Understanding and Managing Model Drift in Machine Learning

What is Model Drift?

10 min readMay 3, 2023

Model drift refers to the decline in model performance due to changes in data and relationships between input and output variables. This drift can lead to a decrease in model quality in a production environment, resulting in inaccurate and suboptimal forecasts. This can negatively impact a company, potentially causing significant losses. Therefore, it is essential to monitor model degradation to give data scientists and business owners advance warning of a decline in model performance, allowing them time to:

Correct data drift
Retrain the model with new data
Update the model with new data source variables
Change the usage policy (altering cutoff thresholds or adding new additional rules for when and how to use the model)

Questions to Answer

To determine whether a model is still functioning correctly or not, we should monitor the model to some extent. Additionally, we should have more than just the business metrics that led us to choose the model initially. To gain control over our AI projects, we should be proactive about potential threats, always monitoring a deployed project in a production environment.

Thus, we should answer the following questions:

Do incoming data reflect the same patterns as the data on which the model was built?
Is the model performing as well as it did during the design phase?
If not, why not?

Before answering these questions, let’s examine the different types of data drift.

Types of Model Drift

There are several types of drift to be aware of. To better understand them, let’s use an example. Assume that a year ago, we built a model for detecting fraudulent credit card transactions. At the time of deployment, we were satisfied with its metrics.

Now, one year later, a lot could have changed. Let’s examine the various situations that could arise:

Concept Drift: This type of drift occurs when the rules and relationships between input and output data change. For instance, fraudsters may change their methods, making fraudulent transactions appear more like normal ones to deceive algorithms and humans.

Data Drift: This occurs when incoming data changes, but the decision boundary remains valid. Data drift can be further divided into two categories:

Label Drift: The output values change, altering the probability of the predicted variable (e.g., an increase in the number of fraud cases).
Feature Drift: The input values change, altering the probability of input data (e.g., a promotion increases the percentage of fast transfers).

While these changes can affect model quality and accuracy, they do not always impact the model’s metrics. Input data may change, but if observations still fall within the decision boundary, the model will continue to function correctly.

In summary, understanding the different types of drift and monitoring changes in data and model quality can help maintain high-quality predictions. Implementing strategies to manage model drift, such as adaptive models, model lifecycle management, and team education, can minimize the negative effects of model drift on the organization.

Dealing with Model Drift

Managing model drift is a crucial aspect of maintaining the performance of machine learning models. Here are several approaches that can help:

Monitoring drift: Systematically monitoring input and output data, as well as model quality metrics, can help detect drift quickly. Using appropriate tools and techniques, such as descriptive statistics, visualizations, statistical tests, and alert systems, can facilitate drift detection.
Adaptive models: Using models that can adapt to changing data can help maintain high performance in the face of drift. Techniques such as online learning, transfer learning, and ensemble learning can be useful in keeping models up to date.
Lifecycle management and planning: Implementing a model lifecycle management strategy that considers evaluation, updating, and eventual retirement of the model can ease the process of maintaining performance. Anticipating that models will need to be updated or replaced can reduce the risk of model drift occurring.
Understanding the business context: Even if a machine learning model demonstrates high accuracy, it is important to understand how its results impact the organization. Ensuring that the model still delivers business value, even in the presence of drift, is crucial.
Training and education: Educating the team on techniques for monitoring and managing model drift can be key to maintaining high-quality predictions. Organizing workshops, trainings, and meetings to share knowledge and experiences can help the team be prepared for potential issues related to model drift.

Model drift is a significant factor affecting the performance of machine learning models over time. Understanding the different types of drift and monitoring changes in data and model quality can help maintain high-quality predictions. Implementing strategies to manage model drift, such as adaptive models, model lifecycle management, and team education, can minimize the negative effects of model drift on the organization.

Model Drift: Practical Examples

Let’s summarize the different types of drift using a simple example:

Concept drift: Fraudulent transactions initially involved currency transfers. Now, criminals have shifted to instant transfers, which were not present in the sample used to build the model.
Label drift: A higher proportion of fraudulent transactions (as more criminal groups are using the same methods as when the model was built).
Feature drift: A higher proportion of customers using VISA (e.g., due to a Mastercard outage that day).

Model Drift: Rate of Occurrence

Instead of wondering if our first model will become outdated, we should focus on WHEN it will happen. One of the key issues is the speed at which drift occurs. The changes may vary significantly and could be:

Sudden: Overnight changes, such as lockdowns due to the COVID-19 pandemic or the outbreak of war in Ukraine, can cause significant shifts.

Gradual or cumulative: When the value of a feature changes slowly, it might not be obvious that a significant problem is on the horizon. Small changes over an extended period may go unnoticed, especially if appropriate warning levels are not in place.

Impulsive and one-time: For example, when data is incorrectly fed into the system.

Some also distinguish seasonal or recurring drift, such as Black Friday sales or national holiday sales at a supermarket chain. However, seasonality is a known concept in modeling, and although it may look like data drift in the short term, we can predict recurring changes in the long run.

Detecting Model Drift

There are two approaches to detecting model drift, depending on whether we have access to target values for our predicted data:

Ground Truth-Based Approach: Ground truth refers to the correct answer to the question our model is trying to solve. In the case of credit card transactions, it would be whether a transaction is genuinely fraudulent.

When we know the values for all predictions, we can accurately assess our model’s performance and compare it to the model’s performance during the design phase. This is the best way to monitor a model: simply calculate business metrics and obtain an answer.

More advanced supervised learning methods can also be employed, such as:

Sequence analysis
Statistical process control (SPC) for assessing the pace of change
More precise monitoring of multiple distribution changes (ADWIN)

Note that it is essential to discuss specific metrics with the business.

The biggest challenge in monitoring ground truth is acquiring the truth. This may require:

a) Waiting for an event (e.g., if our model predicts a one-year horizon)

b) Engaging individuals to manually label data

For transaction fraud detection, we need time for a customer to file a complaint and an analyst to investigate the case and confirm the fraud. However, no one wants to deploy a model in production and wait for a year with their eyes closed. This is where the second approach comes in.

Input Drift-Based Approach: When ground truth is available, it is the best way to monitor a model in production. However, obtaining ground truth can be slow and costly.

If our use case requires quick feedback or ground truth is not yet available, evaluating input drift may be a good alternative. The model’s performance is a reflection of the data used to train it. Therefore, a change in the distribution of input data can lead to data drift.

We assume that the model will make accurate predictions when the data it was built on reflects the data it currently processes.

If there are significant differences between the data used to build the model and the data it now processes, the model’s performance is likely at risk.

Unlike ground truth monitoring, we only need input data for this approach. We should focus on how the real-world data distributions differ from the model’s training data.

We have several methods for monitoring model drift, such as:

Population Stability Index (PSI): This is the most popular metric used in financial industries, comparing the number of observations in the division of a variable’s values into bins.
Kullback-Leibler divergence: Compares the differences between two distributions — the current one and the one from model construction.
Jensen-Shannon divergence: Based on KL divergence, it measures the similarity between two probability distributions and always has a finite value.
Kolmogorov-Smirnov test (KS test): Quantifies the distance between the sample distribution and the cumulative distribution. It is useful for atypical distribution means.

Additionally, more advanced unsupervised learning methods can be employed, such as MD3, FAAD, UDetect, DDAL, SAND, etc. If you’re interested in this topic, you can check out a compilation HERE.

Model drift — how to analyze the cause? Great! By applying the above methods, we’ve discovered that something is off with our data. Now we need to trace back to the root cause.

There are two main possible causes:

Data integrity issue

This is the first place to check if anything has changed. There might be an error or a “bug” in the pipelines that feed our data, or the API may have changed and now returns empty values.

It is essential to verify this before delving deeper into the problem search. We want to receive alerts about something non-standard. We may have sudden data value gaps, new values outside the original range, or perhaps our company developed a new product catalog.

Real label or feature change

Once we are sure there is no data integrity issue, we can delve deeper and examine drift analytics. First and foremost, we focus on comparing distributions at the level of individual variables and finding discrepancies in the construction sample.

Which distributions for which variables have changed? Are these changes in features that most affect the model’s predictions? It is worth looking at features from a time perspective (e.g., in monthly or weekly windows) and how their distributions have changed.

At this stage, it is worth collaborating with the model creators (if you didn’t create it yourself), a data analyst, or a domain expert who can help understand the differences, e.g., whether it was a slow change or a sudden one.

You can also prepare analyses for different segments (e.g., customer age, sales channel, region). There is a high chance that this will lead you in the right direction.

How to fix the model?

It depends on the findings above.

In most cases, from my experience, it was a pipeline repair with data that changed its scope due to another project (e.g., there were empty values and suddenly zeros appeared) or incorrect data supply (followed by an error email and re-supply, after which everything returned to normal).

For deteriorating models, it might be worth refreshing the model with new data and a new sample with a shifted time window.

Sometimes you may notice that switching models may be necessary depending on the cycles that occur (if the base model didn’t learn this by itself).

You can look for segments where the model performs worse and perhaps slightly adjust the cutoff threshold for them.

There are many methods and ways to address this issue, and one could easily write a dissertation on this topic.

Oh, one more thing: ready-to-use libraries with solutions

I remember when I dealt with credit risk, I wrote such solutions using VBA and Excel. Fortunately, today there are many Python packages that provide ready-to-use solutions, and you don’t have to write everything yourself.

Here are a few example packages that I had the opportunity to use for detecting model drift:

Thanks to Mirosław Mamczur