Data Drift
May 20, 2023
AI and machine learning have revolutionized the way we solve problems and make decisions. They have enabled us to extract valuable insights from vast amounts of data and automate tasks that were once time-consuming and error-prone. However, as with any technology, they have their limitations and challenges. One of the most pressing challenges is data drift.
What is Data Drift?
Data drift refers to the phenomenon where the statistical properties of the data used to train a machine learning model change over time, rendering the model less accurate or even useless. The term “drift” implies a gradual and subtle change, rather than a sudden and drastic one. Data drift can occur for various reasons, such as:
- Changes in the underlying population: the people or things that the data represents may change over time, leading to different patterns and distributions. For example, a model that predicts loan defaults based on historical data may become obsolete if the demographics or economic conditions of the borrowers change.
- Changes in the data-generating process: the way that the data is collected, processed, or stored may change over time, leading to different biases, errors, or missing values. For example, a model that classifies images based on a certain resolution or quality may fail if the images become compressed or distorted.
- Changes in the context or environment: the external factors that affect the data or the model may change over time, leading to different correlations, dependencies, or interactions. For example, a model that predicts traffic congestion based on weather and time of day may fail if a new highway is built or a major event occurs.
Data drift is a common issue in real-world machine learning applications, especially those that involve streaming data, online learning, or dynamic environments. It can lead to a loss of predictive power, an increase in false positives or false negatives, and a decrease in user trust and satisfaction.
Why is Data Drift a Challenge?
Data drift is a challenge for several reasons:
- It undermines the validity of the machine learning models: if the data that the models are based on no longer reflects the reality, then the models are no longer reliable or accurate. This can lead to wrong decisions, missed opportunities, or even harm to people or organizations.
- It requires constant monitoring and maintenance: to detect and mitigate data drift, machine learning models need to be continuously monitored and updated. This requires significant resources and expertise, and may be impractical or impossible in some cases.
- It raises ethical and legal concerns: if the machine learning models are used for sensitive or critical tasks, such as healthcare, finance, or justice, then data drift may have serious ethical and legal implications. It may lead to discrimination, bias, or unfairness, and violate privacy or security regulations.
- It highlights the importance of data governance: to avoid data drift, machine learning models need to be built on high-quality data that is well-documented, curated, and governed. This requires a robust data infrastructure, policies, and procedures, and a culture of data literacy and accountability.
How to Address Data Drift?
To address data drift, machine learning practitioners have developed various techniques and approaches. Some of the most common ones are:
- Retraining: periodically retraining the machine learning model with fresh data that reflects the current reality. This ensures that the model stays up-to-date and adapts to new patterns and distributions. However, retraining can be expensive and time-consuming, and may require additional labeled data or human expertise.
- Monitoring: continuously monitoring the performance of the machine learning model and the quality of the incoming data. This can help detect data drift early on and trigger alerts or actions. However, monitoring can also produce false positives or negatives, and may require a trade-off between accuracy and latency.
- Adapting: dynamically adapting the machine learning model to the changing data by updating its parameters, features, or algorithms. This can help mitigate data drift without necessarily retraining the model from scratch. However, adapting can also introduce biases or overfitting, and may require careful tuning and validation.
- Ensembling: combining multiple machine learning models that are trained on different versions of the data, or using ensemble methods that are inherently robust to data drift. This can help improve the accuracy and robustness of the machine learning system, especially in cases where retraining, monitoring, or adapting are not feasible or effective. However, ensembling can also increase the complexity and computational cost of the system, and may require additional data or model selection.
Each of these techniques has its strengths and weaknesses, and their effectiveness depends on the specific context and requirements of the machine learning problem. Therefore, it is important to carefully assess and compare them, and to choose the most suitable ones based on empirical evidence, domain knowledge, and stakeholder input.
Examples of Data Drift
To illustrate the concept of data drift and its impact on machine learning models, let us consider two examples:
Example 1: Image Recognition
Suppose we want to build a machine learning model that can recognize dogs and cats in images. We collect a dataset of 10,000 images of dogs and cats, labeled as such. We split the dataset into a training set of 9,000 images and a validation set of 1,000 images. We use a convolutional neural network (CNN) to train the model on the training set, and evaluate its accuracy on the validation set. We achieve an accuracy of 90%, which we consider good enough to deploy the model in the wild.
However, after a few months, we notice that the model’s accuracy has dropped to 70%. We investigate the cause and find out that the images in the validation set are no longer representative of the images in the real world. Specifically, we find that:
- There are many new breeds of dogs and cats that were not in the original dataset, and some of them look very different from the ones in the original dataset.
- There are many images that contain both dogs and cats, or other animals or objects, which were not included in the original dataset.
- There are many images that have different resolutions, lighting conditions, or backgrounds, which were not accounted for in the original dataset.
To address this data drift, we can use one or more of the techniques mentioned earlier. For example, we can:
- Retrain the model with new images that reflect the current reality, or use transfer learning to adapt pre-trained models to the new images.
- Monitor the model’s accuracy and confusion matrix on a regular basis, and use threshold-based or anomaly-based alerts to trigger retraining or updating.
- Adapt the model’s architecture or hyperparameters to handle new breeds, mixed images, or diverse conditions, or use data augmentation techniques to generate synthetic images.
- Ensemble multiple models that are trained on different subsets or versions of the data, or use techniques such as bagging, boosting, or stacking to combine them.
Example 2: Credit Scoring
Suppose we want to build a machine learning model that can predict the likelihood of a customer defaulting on a loan. We collect a dataset of 100,000 customers, labeled as either default or non-default, and a set of features that are relevant to credit scoring, such as income, age, education, employment, and credit history. We split the dataset into a training set of 90,000 customers and a validation set of 10,000 customers. We use a logistic regression model to train the model on the training set, and evaluate its accuracy on the validation set. We achieve an accuracy of 80%, which we consider good enough to use the model for loan decisions.
However, after a few years, we notice that the model’s accuracy has dropped to 60%. We investigate the cause and find out that the customers in the validation set are no longer representative of the customers in the real world. Specifically, we find that:
- There are many new types of loans, such as peer-to-peer loans, crowdfunding loans, or microloans, which were not in the original dataset, and have different risk profiles or default rates.
- There are many new types of customers, such as gig workers, freelancers, or immigrants, who have different income sources, employment status, or credit histories, which were not in the original dataset, and have different risk profiles or default rates.
- There are many changes in the economic conditions, such as interest rates, inflation, or unemployment, which affect the customers’ ability to repay the loans, and were not accounted for in the original dataset.
To address this data drift, we can use one or more of the techniques mentioned earlier. For example, we can:
- Retrain the model with new data that reflects the current reality, or use feature selection or engineering to handle new types of loans or customers.
- Monitor the model’s accuracy and confusion matrix on a regular basis, and use feedback loops or active learning to update the model’s predictions or decisions.
- Adapt the model’s decision threshold or objective function to balance the trade-off between false positives and false negatives, or use techniques such as cost-sensitive learning or fairness constraints to mitigate bias or discrimination.
- Ensemble multiple models that are trained on different subsets or versions of the data, or use techniques such as Bayesian inference or causal inference to account for the uncertainty or causality of the data.
Data drift is a pervasive and challenging issue in machine learning, but it is also an opportunity for innovation and improvement. By understanding the causes and consequences of data drift, and by using appropriate techniques and approaches, we can build machine learning systems that are more accurate, robust, and trustworthy. However, data drift also requires us to rethink our data governance practices, and to ensure that our models are built on high-quality data that reflects the diversity and dynamics of the real world.