As a follow-on blog post to our recent podcast about non-parametric systems, we created a visual reference guide to illustrate the discussion we had. In the following examples, we are using Python 3.9 and the excellent scientific computing libraries of numpy, scipy, matplotlib, and seaborn.

As discussed on the podcast, which we recommend you listen to prior to reading this post, probability distributions are, simply, mathematical functions that represent the probability of a value occurring. In the podcast we discussed several standard distributions, and how they are often used. Below are illustrations of the common distributions we mentioned:

Discrete Uniform:is the simplest distribution. It is used when all outcomes are equally likely. For example, if you roll a fair die the probability of getting any number is 1/6. Thus the probability of rolling a 1, 2, 3, 4, 5, or 6 is uniform.

This can be formalized as: $f(x|n) = \frac{1}{n}$ where $n$ is the number of possible outcomes.

Also called the Gaussian distribution. This is the most common distribution we discuss in statistics. It is used when the data is symmetric and shaped like a "bell". For example, the heights of people in a population are normally distributed as well as their SAT scores.

This distribution is defined by two parameters: the mean and the standard deviation. The mean $(μ)$, typically 0 is the center of the distribution, and the standard deviation $(σ)$, typically 1, is a measure of the spread of the distribution.

The Probability Distribution Function (PDF) of the normal distribution is more formally defined as:

$f(x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\left( \frac{-(x-\mu)^2}{2\sigma^2}\right)}

$

Below is an example of 2022 of the SAT score data with the red dotted line being a 'perfect' normal distribution.

Source: nces.ed.gov/programs/digest/d22/tables/dt22_226.40.asp

In the plot below, we overlay both the expected distribution of SAT (red dotted line) vs actual distribution (green dotted line) on top of the underlying data: a very close fit!

Switching gears, below, we will attempt to model the powerlaw distribution using a normal distribution. The powerlaw distribution is used to model the distribution of wealth, the frequency of words in a language, the size of cities, and many other phenomena. It can be formalized as $f(x|\alpha,x_{\min}) = \frac{\alpha-1}{x_{\min}} \left(\frac{x}{x_{\min}}\right)^{-\alpha}$ where $\alpha$ is the scaling factor and $x_{\min}$ is the minimum value.

In the plot above you can see that unlike the SAT distributions above, these two lines are no where near each other. Different distributions used for different use cases.

In the following section we will briefly touch on concepts of statistical testing. In the example below we calculate and illustrate the probability that an SAT score of 1450 occurred by chance.

By using the probability distribution we can calculate the probability of observing a given value. For example, if we assume that the data is normally distributed we can calculate the probability of obtaining an SAT score over 1400. This is shown with the shaded area in the figure where green is the probability of the outcome.

With a little bit of math, we can use the estimated parameters to calculate the probability of observing a given value or a more extreme value. This is called a p-value. The p-value is calculated using the z-score, which is the number of standard deviations $(σ)$ away from the mean the observed value is. -

$$z = {x- \mu \over \sigma}$$

If the p-value is less than a threshold (usually $\alpha = 0.05$) we say that the result is statistically significant. This means that the probability of observing the data we did is less than 5% assuming the null hypothesis is true. The null hypothesis is the hypothesis that there is no difference between the groups we are comparing.

Probability of getting a score of 1400 or higher is 5.26%

The z-score of the point is 1.62 σ

The p-value of the point is 0.05

Let's revisit our example comparing the powerlaw distribution with the gaussian distribution. This time we will find the mean $μ$ and std $σ$ of the powerlaw and then generate data that fits those parameters. We will then compare the powerlaw data with the Gaussian data using the non-parametric Kolmogorov–Smirnov test.

KS Statistic: 0.09362

p-value: 0.0

The Kolmogorov–Smirnov Test checks if a sample is part of a probability distribution, with no assumptions made about the distribution, in the same manner we tested with a z-score above, which assumes normal distribution *and that the population standard deviation is known*, as well as a sample size, $n>30$. The K-S statistic calculates the distance between the two empirical distribution functions to determine if there is a statistically significant difference between the two. The Kolmogorov–Smirnov Test has a null hypothesis that the data follows the same distribution, with the alternative hypothesis being that the data does not follow the same distribution. The test statistic is

$$D = \max _{1 \leq i \leq N}(F(Y_i)-\frac{i-1}{N},\frac{i}{N}- F(Y_i))$$

- $H_0$: the distributions are the same (Null Hypothesis)
- $H_a$: the distributions are different

In the above case we find that the p-value is $\approx0.0$, which means that we can reject the null hypothesis that the data is normally distributed. Since the K-S test is essentially assumption free, it can be used in instances were Z-score would normally be used. The only real negative of using non-parametric test statistics, such as the K-S test is if your parametric assumptions are met, as in the Z-score example above, the parametric tests will produce higher statistical power than a non-parametric test (power essentially meaning finding statistical significance if there is any to be found).

In this post we've provided a visual aid extension to our non-parametric podcast discussion. If this content was helpful, you would like us to go deeper in an area, or if you have questions or requests, please submit feedback on our site.

Until next time,

Andrew & Sid, The AI Fundamentalists.