Information theory: Not the best approach for model monitoring

In a recent podcast of the AI Fundamentalists, we spoke about information theory and why, although it is a very valuable discipline, its divergences are often the wrong choice for model and data drift monitoring. In this post, we summarize the goals of information theory, define the differences between metrics and divergences, explain why divergences are the wrong choice for monitoring, and propose better alternatives.

What is information theory?

Information theory as we know it today came out of Claude Shannon's work in the 1940s at Bell Labs ^[1]. Information theory is based on probability and statistics, to study how information is transferred and used, quantified as bits. Information theory was instrumental in cryptography, messaging, and related fields as it focuses on understanding how many bits of information are minimally needed to convey a given idea. One of the main driving concepts in information studies is entropy. Entropy quantifies the amount of uncertainty involved with executing a data process. The less variability within a given process, the less “information” is conveyed, yielding a lower entropy.

How information theory relates to modeling systems, outside of entropy-based input selection, is that of input and output data monitoring. To understand why information theory methods might not be the best for monitoring, we first need to understand the difference between a distance metric and a divergence, and why it matters.

What is a distance metric?

Without diving too deep down a mathematical rabbit hole, it is important to understand the concept of vectors and spaces. A vector can be thought of as a ‘column’ of values, often notated as: $x^{'} = [x_{1}, x_{2}, . . ., x_{n}]$ . A single vector, $x^{'}$ , would be considered a single point in a space ^[2]. The most commonly used spaces are made of 2-3 dimensions (vectors), typically referred to as Euclidean in honor of the ancient (approximately 300 BC) Greek mathematician Euclid. Euclidean geometry gives us familiar properties, such as the shortest path between two points being a straight line**.*

More simply, this means the Euclidean distance between your house and Starbucks is the shortest possible path by going through the globe, not the best sequence of roads/tunnels to get there.

We can calculate the distance of two to three points in a space, referred to as a metric space if four distance properties are met.

Properties of a distance metric**

Positivity: distances must always be positive
1. $d (x, y) \geq 0 \forall x, y$
Identity: the distance is zero if points x and y are the same
1. $d (x, y) = 0 iff x = y$
Symmetry: the distance from x → y is the same as the distance from y → x
1. $d (x, y) = d (y, x) \forall x, y$
Triangle inequality: you can arrive at y by detouring through z, but this will not make your journey faster than the shortest path between x and y.
- $d (x, y) \leq d (x, z) + d (z, y) \forall x, y, z$
- Requires the distances to be linear

If any of these properties are not met, the points are not in a metric space and distances cannot be calculated. This may seem trivial, but metric spaces are what make chess AIs and GPS driving instructions possible. In the mathematical sphere, failing these properties carries ramifications, one of which is the ability to use standard p-value and alpha suite. Universally, with a properly configured test on a distance metric, we can use the properties of alpha and p-value to determine if a change in distance between two points in space is statistically significant. Without this, we must rely on divergence-specific heuristics to determine if a result is adverse. In responsibly deploying modeling systems, knowing when distributions have statistically significantly shifted, specifically across many inputs and modeling systems, is critical.

What are divergences?

Now that we’ve walked through what a distance metric is and briefly discussed why they matter, let’s get back to exploring commonly used information theory divergences. Consider a divergence to essentially be a vector space ^[8] that is locally similar enough to perform calculus on, but that does not meet all of the properties required to be in the metric space.

Metrics derived from information theory

Kullback-Leibler Divergence (KL Divergence)^[5]

Kullback-Leibler divergence (KL) measures the ‘surprise’ in the difference between two probability distributions.
- Discrete equation:
  - $D_{KL} (P ∥ Q) = - \sum_{x \in X} P (x) \log (\frac{Q (x)}{P (x)})$
It is not a metric
- It is asymmetric, meaning $D_{KL} (P ∥ Q) \neq D_{KL} (Q ∥ P)$
- It does not satisfy the Triangle Inequality property
KL divergence can return a non-definite result

Jensen-Shannon Divergence^[6]

Jensen–Shannon divergence is an extension of the Kullback-Leibler divergence
Fixes the asymmetric and non-definite issues with KL divergence.
Jensen-Shannon never experiences a 0 outcome, unlike Kullback-Leibler

Population Stability Index (PSI)^[7]

Population Stability Index was first published by Karakoulas in 2004[7] and has since been used in model risk management for quantifying changes in model distributions
It is calculated by using specified bins and calculating the percentage of data from each distribution that falls within each bin.
Although symmetric, PSI has the drawbacks of having a possible 0 outcome, requiring user-specified bins, and lacking a measurable level of criticality.

Why has information theory gained a following in model risk management and MLOps?

We don’t definitively know. It has always been there and is the backbone of what makes up computer science. Most Machine Learning and MLOps Engineers have computer science backgrounds and may not be deeply familiar with statistics. Thus they may default to the Information Theory hammer in their statistical toolbox. When you have a hammer, everything is a nail.

Disregarding the theoretical downsides, Evidently AI has an excellent post ^[3]comparing multiple methods. This work notably highlights the empirical finding that Kullback-Leibler Divergence, Jensen-Shannon Divergence, and Population Stability Index are ineffective tools for detecting drift.

As an alternative, we propose that non-parametric metrics are ‘the way’. The main exception to this is if you have a strong understanding of your underlying statistical distributions, in which case parametric statistics give you the highest level of performance. We refer you to our previous podcast and post on non-parametric statistics for more details ^[4].

Conclusion

In this post we've provided a follow-up discussion from our podcast on information theory. If this content was helpful, you would like us to go deeper in an area, or if you have questions or requests, please submit feedback on our site.

Until next time,

Andrew & Sid, The AI Fundamentalists.

‍

Notes

* Note: Many of the underpinnings for modeling and computational systems hinge on this concept of Euclidean distance.

** Notation review:

$x, y, z$ are 3-dimensional vectors in Euclidean space.
$d (x, y)$ is the distance between 2 real value vectors $x$ and $y$
- Real values, often notated as R, are any number that could theoretically be found on a number line
$\forall$ means ‘for all possible’

References

[1]: C. E. Shannon. A Mathematical Theory of Communication. System Technical Journal 1948-07: Volume 27, Issue 3: AT&T Bell Laboratories. 1948.

[2]: Chiang, Alpha C., and Kevin Wainwright. Fundamental Methods of Mathematical Economics. 4th ed. Boston, Mass: McGraw-Hill/Irwin, 2005.

[3]: “Which Test Is the Best? We Compared 5 Methods to Detect Data Drift on Large Datasets.” Accessed March 22, 2024. https://www.evidentlyai.com/blog/data-drift-detection-large-datasets.

[4]: Exploring non-parametric statistics. Accessed March 22, 2024. https://www.monitaur.ai/blog-posts/exploring-non-parametric-statistics

[5]: Kullback, S., and R. A. Leibler. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22, no. 1 (March 1951): 79–86. https://doi.org/10.1214/aoms/1177729694.

[6]: Lin, J. “Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory 37, no. 1 (January 1991): 145–51. https://doi.org/10.1109/18.61115.

[7]: Karakoulas, Grigoris. “Empirical validation of retail credit-scoring models” RMA Journal 87, (September **2004): 56-60. https://cms.rmau.org/uploadedFiles/Credit_Risk/Library/RMA_Journal/Other_Topics_(1998_to_present)/Empirical Validation of Retail Credit-Scoring Models.pdf

[8] David Guichard. “Divergence and Curl.” Calculus: early transcendentals; Chapter 16.5. Whitman College. https://www.whitman.edu/mathematics/calculus_online/section16.05.html

‍