In a recent podcast of the AI Fundamentalists, we spoke about information theory and why, although it is a very valuable discipline, its *divergences* are often the wrong choice for model and data drift monitoring. In this post, we summarize the goals of information theory, define the differences between *metrics* and *divergences,* explain *why* divergences are the wrong choice for monitoring, and propose better alternatives.

Information theory as we know it today came out of Claude Shannon's work in the 1940s at Bell Labs ^{[1]}. Information theory is based on probability and statistics, to study how information is transferred and used, quantified as bits. Information theory was instrumental in cryptography, messaging, and related fields as it focuses on understanding how many bits of information are minimally needed to convey a given idea. One of the main driving concepts in information studies is entropy. Entropy quantifies the amount of uncertainty involved with executing a data process. The less variability within a given process, the less “information” is conveyed, yielding a lower entropy.

How information theory relates to modeling systems, outside of entropy-based input selection, is that of input and output data monitoring. To understand why information theory methods might not be the best for monitoring, we first need to understand the difference between a distance metric and a divergence, and why it matters.

Without diving too deep down a mathematical rabbit hole, it is important to understand the concept of *vectors* and *spaces.* A vector can be thought of as a ‘column’ of values, often notated as: $x^\prime = [x_1,x_2,...,x_n]$. A single vector, $x^\prime$, would be considered a single *point* in a *space* ^{[2]}. The most commonly used *spaces* are made of 2-3 *dimensions (vectors),* typically referred to as Euclidean in honor of the ancient (approximately 300 BC) Greek mathematician Euclid. Euclidean geom*e*try gives us familiar properties, such as the shortest path between two points being a straight line**.*

More simply, this means the Euclidean distance between your house and Starbucks is the shortest possible path by going through the globe, not the best sequence of roads/tunnels to get there.

We can calculate the *distance* of two to three *points* in a *space,* referred to as a *metric space* if four distance properties are met.

**Positivity**: distances must always be positive- $d(x,y) \geq 0\ \forall\ x,y$

**Identity**: the distance is zero if points x and y are the same- $d(x,y) = 0\ \text{iff}\ x = y$

**Symmetry**: the distance from x → y is the same as the distance from y → x- $d(x,y) = d(y,x)\ \forall\ x,y$

**Triangle inequality**: you can arrive at y by detouring through z, but this will not make your journey faster than the shortest path between*x*and*y*.- $d(x,y) ≤ d(x,z) + d(z,y)\ \forall\ x, y,z$
- Requires the distances to be linear

If any of these properties are not met, the *points* are not in a *metric space* and *distances* cannot be calculated. This may seem trivial, but metric spaces are what make chess AIs and GPS driving instructions possible. In the mathematical sphere, failing these properties carries ramifications, one of which is the ability to use standard p-value and alpha suite. Universally, with a properly configured test on a distance metric, we can use the properties of alpha and p-value to determine if a change in *distance* between two *points* in *space is* statistically significant. Without this, we must rely on divergence-specific heuristics to determine if a result is adverse. In responsibly deploying modeling systems, knowing when distributions have statistically significantly shifted, specifically across many inputs and modeling systems, is critical.

Now that we’ve walked through what a distance metric is and briefly discussed why they matter, let’s get back to exploring commonly used information theory divergences. Consider a divergence to essentially be a vector space ^{[8]} that is locally similar enough to perform calculus on, but that does not meet all of the properties required to be in the metric space.

- Kullback-Leibler divergence (KL) measures the ‘surprise’ in the difference between two probability distributions.
- Discrete equation:
- $D_\text{KL}(P \parallel Q) = -\sum_{x\in\mathcal{X}} P(x) \log\left(\frac{Q(x)}{P(x)}\right)$

- Discrete equation:
- It is not a metric
- It is asymmetric, meaning $D_\text{KL}(P \parallel Q) \neq D_\text{KL}(Q \parallel P)$
- It does not satisfy the Triangle Inequality property

- KL divergence can return a non-definite result

- Jensen–Shannon divergence is an extension of the Kullback-Leibler divergence
- Fixes the asymmetric and non-definite issues with KL divergence.
- Jensen-Shannon never experiences a 0 outcome, unlike Kullback-Leibler

- Population Stability Index was first published by Karakoulas in 2004[7] and has since been used in model risk management for quantifying changes in model distributions
- It is calculated by using specified bins and calculating the percentage of data from each distribution that falls within each bin.
- Although symmetric, PSI has the drawbacks of having a possible 0 outcome, requiring user-specified bins, and lacking a measurable level of criticality.

We don’t definitively know. It has always been there and is the backbone of what makes up computer science. Most Machine Learning and MLOps Engineers have computer science backgrounds and may not be deeply familiar with statistics. Thus they may default to the Information Theory hammer in their statistical toolbox. When you have a hammer, everything is a nail.

Disregarding the theoretical downsides, Evidently AI has an excellent post ^{[3] }comparing multiple methods. This work notably highlights the empirical finding that Kullback-Leibler Divergence, Jensen-Shannon Divergence, and Population Stability Index are ineffective tools for detecting drift.

As an alternative, we propose that non-parametric metrics are ‘the way’. The main exception to this is if you have a strong understanding of your underlying statistical distributions, in which case parametric statistics give you the highest level of performance. We refer you to our previous podcast and post on non-parametric statistics for more details ^{[4]}.

In this post we've provided a follow-up discussion from our podcast on information theory. If this content was helpful, you would like us to go deeper in an area, or if you have questions or requests, please submit feedback on our site.

Until next time,

Andrew & Sid, The AI Fundamentalists.

* Note: Many of the underpinnings for modeling and computational systems hinge on this concept of Euclidean distance.

** Notation review:

- $x,y,z$ are 3-dimensional
*vectors*in Euclidean*space.* - $d(x,y)$ is the distance between 2
*real value*vectors $x$ and $y$- Real values, often notated as
**R**are any number that could theoretically be found on a number line*,*

- Real values, often notated as
- $\forall$ means ‘for all possible’

[1]: C. E. Shannon. *A Mathematical Theory of Communication*. System Technical Journal 1948-07: Volume 27, Issue 3: AT&T Bell Laboratories. 1948.

[2]: Chiang, Alpha C., and Kevin Wainwright. *Fundamental Methods of Mathematical Economics*. 4th ed. Boston, Mass: McGraw-Hill/Irwin, 2005.

[3]: “Which Test Is the Best? We Compared 5 Methods to Detect Data Drift on Large Datasets.” Accessed March 22, 2024. https://www.evidentlyai.com/blog/data-drift-detection-large-datasets.

[4]: Exploring non-parametric statistics. Accessed March 22, 2024. https://www.monitaur.ai/blog-posts/exploring-non-parametric-statistics

[5]: Kullback, S., and R. A. Leibler. “On Information and Sufficiency.” *The Annals of Mathematical Statistics* 22, no. 1 (March 1951): 79–86. https://doi.org/10.1214/aoms/1177729694.

[6]: Lin, J. “Divergence Measures Based on the Shannon Entropy.” *IEEE Transactions on Information Theory* 37, no. 1 (January 1991): 145–51. https://doi.org/10.1109/18.61115.

[7]: Karakoulas, Grigoris. “Empirical validation of retail credit-scoring models” *RMA Journal 87,* (September **2004): 56-60. https://cms.rmau.org/uploadedFiles/Credit_Risk/Library/RMA_Journal/Other_Topics_(1998_to_present)/Empirical Validation of Retail Credit-Scoring Models.pdf

[8] David Guichard. “Divergence and Curl.” *Calculus: early transcendentals;* Chapter 16.5. Whitman College. https://www.whitman.edu/mathematics/calculus_online/section16.05.html