Machine learning gives us powerful tools to answer business questions within a blink of an eye, but those answers can be deceptive. As the use of enterprise AI constantly evolves, ML practitioners must adapt and identify creative ways to maintain the quality of their ML models.
Businesses that rely on ML and AI to make decisions have certain expectations, which include: accuracy, reliability, fairness, and other metrics that are relevant to business needs. But AI itself is a fairly new branch of computer science and a path to attaining those expectations is not always clear.
Democratization of AI and easy access to ML tools created a booming industry that produced many successful startups and products. From fashion and mental health support to climate change and the food we eat, there are many areas where Artificial Intelligence has proved to be useful.
Although the promise of AI could ultimately help humanity solve the difficult problems we are facing, we should not put blind faith in this technology(yet). There are many examples where AI has failed miserably.
The IBM Watson oncology platform is one example. The platform’s promise was to deliver state-of-the-art tools for identifying various cancers and appropriate treatments. It was used by physicians all over the world, but unfortunately to little effect. The system was misdiagnosing and prescribing incorrect medications and treatments, endangering patients if its recommendations were accepted without physician supervision. The project was shut down soon after the launch.
The lesson here is that AI is not mature enough to put it in front of complicated messy problems and hope it will figure them out on its own. The difficulty in creating successful AI products is that they’re very complicated and require multiple teams of engineers, data scientists, and researchers to build an initial AI system, but the work doesn’t stop there.
At the heart of every AI system is an ML model and they tend to deteriorate over time. As the model performance decreases, it produces faulty predictions. If there is no ML monitoring in place, it is almost a strategy of rolling the dice and gambling that the model output is correct.
In most real-world application scenarios, the machine learning model’s performance deteriorates in production and consistently degrades as the systems evolve. This problem is commonly referred to as model degradation.
To understand what is happening inside the ML model and why its performance is decreasing overtime we must understand this simple concept: after the data scientist gathered the necessary data, trained the model, adjusted parameters and evaluated it, the final product is a simplified representation of a system or a process in real life captured at a specific moment in time.
Models can be viewed as a snapshot of a representation of a real life system or a process at a specific period of time. Models are static.
ML models, being static in nature, have one large universal flaw:they are trained to recognize patterns in the data they have seen previously, and will display unpredictable behavior when there is a shift in the data or it contains anomalies.
Which is why it’s critically important to build a system around the model that will keep track of the input data, what the model is predicting, and how accurately it’s doing so. Better yet, the system should include alerts that will proactively notify ML practitioners or data scientists that something has changed, prompting action.
There could be many reasons why the input data has changed, and here are a few examples: seasonal change (there is more demand for AC-units in summer compared to other seasons), spike of interest in a product, demographic changes, and changing inflation rates.
When there is a decrease in the model performance without significant changes in the input data and model outcomes (such as in the contextual examples above), this phenomenon is called a concept drift.
Concept drift is the process of decreasing model performance over time due to changes in the outside world, not necessarily bound to changes in input data. These changes could be economic (inflation rate changed), political (new regulations came into place) or even ecological (climate change forced auto makers to transition from combustion engines to electric).
To measure if the model is no longer able to correctly make predictions, data scientists use statistical tests or simple rule-based mechanisms that will determine whether the model has to be retrained or not. In essence, we measure and weigh how much prediction error we can tolerate before taking an action.
Measuring model performance works great when the ground truth is available, unfortunately, the real world rarely provides us this luxury. Take for example the financial services industry where ML models are used for predicting whether clients will default on a loan. Months or years could pass before ground truth is available to make an assessment of the model performance.
If we can’t have ground truth right away, what options do we have?
We might not always know if model performance has decreased, but we can measure how the input data and model predictions have shifted underlying distributions from training data and past predictions. These shifts are tightly connected to model performance and identified as feature drift and model drift.
Machine learning models can have a large number of features. When combined, these features form a model input, and each of them have an individual impact on the model outcome. For example, if model predicts risk of hospitalization from COVID-19, age would have more impact on the outcome rather than gender. Age is an example of a feature in this case.
It’s good practice to evaluate each feature for drift and several reliable statistical non-parametric tests exist to do the job. These tests have no assumptions about your data and can be applied for both features and a model outcome.
Feature and model drift appear in different forms, but we can formally break them down into following categories:
Data scientists should be able to distinguish between these types to make appropriate decisions for model retraining and features engineering based on their analysis and business needs.
There are several approaches for managing drift and one of them would be to ignore it. In rare cases some features might not contribute to the model outcome significantly and if the goal is to maintain model performance, retraining the entire model for minimal gains might be avoided.
However, such features are commonly excluded from the model by data scientists before production. Another logical approach is to retrain the model, which presents several choices that must be weighed by the data scientist:
Some companies retrain models on a month to month basis without worrying that data is drifting and model performance has decreased. This approach creates a false sense of security because there is no guarantee that model performance changes for better or for worse before ground truth is available.
The implementation of model monitoring systems is crucial for models that make consequential decisions, from either a dollars or people perspective. The nominal direct cost is offset in two ways. Freeing up data scientists to do their best modeling work vs checking in on deployed models, and catching degraded models early. Catching a single model delivering a million dollars per year in value drifting by 5% could result in $50,000 of savings or preserved profit. For large companies, with hundreds or thousands of models making millions of decisions per year, the risk surface is uncomfortably large and some models are certainly not performing optimally.
Maintaining the quality of production machine learning models adds minimal business overhead and the benefits greatly outweigh the costs. Doing so builds trust and a common understanding of the quality and performance of AI systems across business and technical stakeholders. It minimizes the risk of catastrophic errors, creates an environment where data scientists can focus on doing their best work, and builds business leaders' confidence in their model’s performance.
Do you trust your models are performing as intended? And if not, what are you doing about it?
Arseny Turin is an Integration Data Scientist at Monitaur
Notes and sources
 Bayram, Ahmed, Kassler, “From Concept Drift to Model Degradation: An Overview on Performance-Aware Drift Detectors”, arXiv. March 21, 2022.
 Online learning methods are designed to continuously retrain models with a stream of incoming data, which makes them resilient to the data drift. These methods require ground truth to be available.
 In industries such as insurance, banking or healthcare decisions made by models cannot be evaluated immediately because it takes time for the ground truth to become available.