In our previous posts Breaking down bias in AI and* *How does bias happen, technically?*,* we highlighted that things are complex, and discussed the three main causes of automated decision bias. The purpose of this post will be to walk through our three favorite metrics for *detecting* that bias exists for a model in production.

How to measure for bias can be a moving target. In this blog post, we will examine the common methods to evaluate for bias, how they conflict and our recommended approaches.

There are several excellent survey papers on the various bias metrics that we recommend. Our favorite is titled “The Zoo of Fairness Metrics in Machine Learning” and was authored by Castelnovo, et al [1]. Castelnovo et al group the many bias metrics into three broad categories: Group Fairness, Individual Fairness, and Causality based fairness.

As we discussed in Breaking down bias in AI*,* there isn’t one right or wrong way to train an unbiased model, so the approach you take and metrics you use will change depending on context. The short answer for which approach or metric is best is always frustrating: “it depends.” For today’s discussion, we will dig into our personal three favorites:

- Disparate Impact
- Equalized Odds
- Non-parametric cohort analysis

To begin digging into details, we first need to get some notation and terminology out of the way. For this post, we will use the following notion:

- $A$ is the categorical attribute representing the protected attribute
- $X$ is all other non-protected features
- ${\hat Y = F(X,A) \in (0,1)}$ Is the model function
- Model loss is minimized via $L(Y,\hat Y)$
- $(x_{1},y_{1}),...,(x_{n},y_{n})$ are the observed data points
- N (
**N**umber of observations) = Number of predictions - TP (
**T**rue**P**ositives) = Number of correctly predicted positive outcomes - FP (
**F**alse**P**ositives) = Number of incorrectly predicted positive outcomes - TN (
**T**rue**N**egatives) = Number of correctly predicted negative outcomes - FN (
**F**alse**N**egatives) = Number of incorrectly predicted negative outcomes - FPR (
**F**alse**P**ositive**R**ate) = Percent of negative outcome predictions that were incorrect - TPR (
**T**rue**P**ositive**R**ate) = Percent of positive outcome predictions that were correct *SR*(**S**election**R**ate) ** = Percent of predictions that were correct and positive outcomes- Protected Classes: Groups that we have a legal and moral obligation to not discriminate against such as: gender, ethnicity, religion, political affiliation, disability, sexuality, and age.
- Proxies: Attributes that can be used to infer a protected class. Frequently identified proxies include: FICO score, Education level, Criminal record, Occupation, Zip code

We will begin with Disparate Impact. The foundational bias metric, and the legal standard even today. Disparate impact as a concept became solidified from Title VII of the 1964 Civil Rights Act which states that:

“An employment practice or policy that appears neutral, but has a disproportionately adverse effect on members of the protected class as compared with non-members of the protected class is illegal.”

To determine if disparate impact exists, the **“80 percent”** test was created from a panel of 32 professionals assembled by the State of California Fair Employment Practice Commission (FEPC) in 1971. This test was then codified in the 1979 Uniform Guidelines on Employee Selection Procedures, a document used by the U.S. Equal Employment Opportunity Commission (EEOC), the Department of Labor, and the Department of Justice in Title VII enforcement.

The equation for Disparate Impact is:

$SR = \frac{\textrm{Positive result count}}{N}$

$\textrm{Disparate Impact Ratio} = \frac{\textrm{Underprivileged Group SR}}{\textrm{Privileged Group SR}}$

$\textrm{Disparate Impact Ratio} = \frac{\textrm{Female SR}}{\textrm{Male SR}}=\frac{0.3}{0.6}=0.50$

Result: <80%, disparate impact is present.

The problem with the Disparate Impact Ratio is that it doesn’t take into account the effects of merit or qualifications. For instance, in our example above, 30% of the women who applied were qualified and thus their selection rate was to be expected. Disparate Impact is still better than nothing and is a key metric to hang your hat on, but it is dated and flags false positives for bias.

For industries such as insurance where discrimination based on credit risk is the core of the business (we can have a separate discussion about if some of the factors that insurers use are in fact biased and shouldn’t be used, but credit risk needs to be assessed regardless), we need a more intelligent metric to better gauge bias.

One of the benefits of the disparate impact ratio is it does not require ground truth data, the correct outcomes vs the predicted outcomes, or how we train a model. The next metric we will discuss, Equalized Odds, solves our merit issue with Disparate Impact but introduces another problem, that of knowing what the correct prediction is.

Equalized Odds, as described by Hardt et al in *Equality of Opportunity in Supervised Learning* is a relatively newer academic criterion for evaluating fairness in machine learning model outcomes. This method seeks to equalize the accuracy of prediction for all demographics[2]. Currently, no accepted constraints exist for what constitutes “unequal odds”. Unlike Disparate impact, equalized odds or equal opportunity punishes models that only perform well on the majority outcome class.

We say that a model satisfies Equalized Odds with respect to a protected attribute A and an outcome Y if the prediction and protected attribute are independent and conditional on the outcome.

$P(\hat Y=1|A=0,Y=y)=P(\hat Y=1|A=1,Y=y), y \in {0,1}$

Thus Equalized Odds is achieved if the probability of a certain prediction is not influenced by flipping the protected attribute.

Practically, we can roughly simplify the above equation into the following:

$\textrm{TPR}=\frac{\textrm{TP}}{\textrm{TP + FN}}$

$\textrm{FPR}=\frac{\textrm{FP}}{\textrm{FP + FN}}$

The goal is to minimize the difference between the TPR of the privileged group and the TPR of the underprivileged group. Likewise, we should also minimize the difference in FPRs between privileged and underprivileged groups. An accepted error rate hasn’t been codified yet, meaning we can’t state that an equalized odds measurement is accepted if it has a 5%/10%/20% difference.

In this example, we compare two groups of people: men and women for equality of opportunity. Here, equalized odds are satisfied because both genders have 80% chance of being hired (TPR) and 70% chance of being rejected (FPR).

$\textrm{Male hired TPR}=\frac{56}{56+14}=\frac{56}{70}=0.80$

$\textrm{Male rejected FPR}=\frac{21}{21+9}=\frac{21}{30}=0.70$

$\textrm{Female hired TPR}=\frac{24}{24+6}=\frac{24}{30}=0.80$

$\textrm{Female rejected FPR}=\frac{49}{49+21}=\frac{49}{70}=0.70$

In the above example, we take into account qualifications, and can accurately determine if the hiring decisions are fair and unbiased. However, in practicality, the equalized odds test is difficult to implement, as many use cases are not as cut and dried as “qualified” and ”unqualified”. We often don’t know until long after the fact (if ever) if our model prediction were correct. As a worrying consequence, you could theoretically obtain a perfect equalized odds score and still be biased (like we saw in the above example) because the qualifications themselves are biased.

*Non-parametric cohort analysis* sounds like a mouthful, but *non-parametric* essentially means we are employing statistical tests that do not assume the data fits into one fixed probability distribution, such as the normal bell curve. A *cohort analysis* is used to analyze the presence of a significant relationship between a protected class feature, such as gender (male/gender*) and an outcome (hired/rejected*).

There are many different nonparametric statistical tests, a topic for another blog post, but for this post, we will keep with our binary outcome and binary protected class problem, i.e. $\hat Y \in 0,1$, and $A \in 0,1$, respectively, and use the McNemar test, a paired nominal test.

To calculate the McNemar statistic, we first create a contingency table of the data.

And perform a hypothesis test.

$H_o: p_b = p_c$

$Ha: p_b \neq p_c$

The test statistic is calculated as: $\chi^2 = \frac{(b-c)^2}{b+c}$

Test statistic = $\frac{(96-26)^2}{96+26}\approx40$ This equates to a p-value** of approximately <0.0001

With a p-value this low, we reject $H_0$, bias *may* be present.

The McNemar test helps us understand the relationship between a protected (or proxy) feature and an outcome in a statistically valid manner. Where this test comes into its own is as a monitoring test where we run a test on a schedule based on every $x$ days or every $x$ number of transactions. Complexity gets introduced as we determine sampling needs, but this test serves as a more statistically valid version of disparate impact. However, it runs into the same problem of whether is discrimination warranted, which equalized odds addresses.

Another complication we often run into is data availability. For fear of bias, many companies do not include any protected features or proxies in their modeling. Worse yet, many do not have their data structured in a way to easily link transactions to this information, essentially leaving us to fly blind.

To combat this, there are a couple ‘hacks’ that the industry has come up with, such as BISG, but there are not nearly as effect as having protected class or proxy information present in the model (we recommend multi-objective modeling - see our previous post How does bias happen, technically?).

Monitaur also has the notion of ‘non-model features’ which, if you have the ability to reference and pull in other data that has a protected class or proxy feature, we can perform bias monitoring on this feature, which is fully isolated from the model.

For when there is absolutely nothing to go on, which occurs often in some industries, Monitaur created a last-ditch approach called Optimal Group Differencing (OGD). In OGD, we use a hyper-parameter optimized unsupervised algorithm technique to identify clusters of transactions that exhibit significantly different correlations in patterns that potentially exhibit bias. OGD will return groups of transactions that may exhibit bias, but without the protected attribute or ground truth outcome, it is still difficult to determine if bias exists in the identified transactions. It is a step in the right direction and gives us the capability for high-risk models to have an individual review a set of transactions that may exhibit bias.

As this post has illustrated the reoccurring theme of this series: “things are complex.” Hopefully this has been helpful to solidify the strengths and weaknesses of several prominent bias metrics. We did not touch on the probability theory behind bias metrics, i.e. independence, separation, and sufficiency, or my personal favorite methods of simulations and counterfactual methods for validating a model is not biased, but we can save these for subsequent posts.

Dr. Andrew Clark is Monitaur’s co-founder and Chief Technology Officer. A trusted domain expert on the topic of ML auditing and assurance, Andrew built and deployed ML auditing solutions at Capital One. He has contributed to ML auditing education and standards at organizations including ISACA and ICO in the UK. He currently serves as a key contributor to ISO AI Standards and the NIST AI Risk Management framework. Prior to Monitaur, he also served as an economist and modeling advisor for several very prominent crypto-economic projects while at Block Science.

Andrew received a B.S. in Business Administration with a concentration in Accounting, Summa Cum Laude, from the University of Tennessee at Chattanooga, an M.S. in Data Science from Southern Methodist University, and a Ph.D. in Economics from the University of Reading. He also holds the Certified Analytics Professional and American Statistical Association Graduate Statistician certifications. Andrew is a professionally trained concert trumpeter and Team USA triathlete.

Footnotes

- There are additional considerations, such as independent vs dependent sampling, statistical assumptions, etc. that are behind the scope of this already lengthy blog post, but can be the focus of another blog post if desired.
- *overly simplified for the sake of example
- **P-values is another topic. We can dive into this in the future as well

[2]Hardt, Moritz, Eric Price and Nathan Srebro. “Equality of Opportunity in Supervised Learning.” *ArXiv* abs/1610.02413 (2016): n. pag.