In one of our previous posts in our series on bias, titled: *How does bias happen, technically?* we touched on the notion of a loss function and how algorithms are trained. In this blog we will dive deeper into what exactly loss functions are as well as how machine learning models are constructed and 'trained'. Here we will walk through some popular different loss functions, explain how they work, introduce the concept of stochastic gradient descent, provide an example of how a loss function gets optimized, and end with a discussion of how *multi-objective optimization* is the future of fair and ethical machine learning modeling.

A loss function, in statistical theory, is a function that calculates the error between actual/true values vs predicted values. Loss functions, a.k.a. cost functions, can take on different shapes and sizes to accommodate different goals or use cases, such as a regression model (a model that predicts a continuous value, such as a risk score between 0 and 100). In order to 'train' a statistical or machine learning model, we undertake the optimization problem of *minimizing* the loss from the loss function. One of the most common methods to train an algorithm and minimize a loss function, is to use *gradient descent*. In gradient descent, we take the partial derivative with respect to the coefficients of the model in order to move towards minima of the differentiated loss function. Below is, hopefully, an intuitive example to understand how gradient descent works:

Imagine an individual is stuck high up in the mountains of Colorado and has gotten lost. Visibility is very low due to the fog, so they do not see a path all the way down the mountain. This hypothetical climber could find their way to the base by employing a real-world variation of gradient descent. To do this, they would look for the path that has the steepest downhill descent in their line of sight. They will take a few steps down this path and then revaluate if they need to change direction or continue down. Eventually they will make it to the bottom of the mountain (the global maxima) or the bottom of a hole in the mountain (a local minimum). The amount of times they measure the steepness of the hill and make adjustments can be considered an *epoch* with the degree of directional change after each *epoch* is their *learning rate*.

For our discussion today, we will use a basic statistical model called *Linear Regression*. Linear Regression is a *simplistic* model that predicts an outcome or *dependent variable* from an observation or *independent variable* with the addition of an intercept. This results in a linear equation of the form *y = mX + c* where *y* is the outcome, *X* is the observation, *m* is the slope of the line, and *c* is a constant error term. Linear regression models are a good reference model before performing more complex modeling techniques. Below is the Linear Regression equation written out in matrix notion:

$ \mathbf{y} = \boldsymbol m X + \boldsymbol c\ \ $ where $\ \ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix},\boldsymbol m = \begin{pmatrix} m_0 \\ m_1 \\ m_2 \\ \vdots \\ m_p \end{pmatrix}, X = \begin{pmatrix} \mathbf{x}^\mathsf{T}_1 \\ \mathbf{x}^\mathsf{T}_2 \\ \vdots \\ \mathbf{x}^\mathsf{T}_n \end{pmatrix}, \boldsymbol c = \begin{pmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{pmatrix}$

Note, for our use case, we are using simple *univariate* linear regression, or only 1 independent variable. Linear regression also supports multivariate regression, or $n+1$ independent variables. For brevity, we have omitted a discussion on the assumptions built into linear regression and the exploratory data analysis that is required before building mission-critical models.

To begin our discussion of loss functions, namely the common *MSE* and *MAE* loss functions, we will first write out as much of the math and Python code to as detailed degree as is practical.

From the plot above, we see our randomly distributed data. Since we are using a Gamma distribution, we can see some outliers.

The first loss function we will define is the ubiquitous *Mean Squared Error* (MSE) loss. MSE is defined as the average squared error between the *actual* dependent, $Y_i$ (*i* being the specific datapoint) value and the *predicted* dependent, $\hat Y_i$ value. As we train a model with gradient descent, over time the average of the squared errors for the dataset decreases. Different loss functions have different use cases and are optimal for different desired outcomes, but MSE is a great starting point for regression (predicting a continuous outcome) problem. Below we have the equation:

$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2$$

Where $n$ is the number of predicted data points, $i$ is a specific data point in the array of values.

Now, to perform gradient descent as described above, we need to solve our MSE for the constants of $m$, the slope term, and $c$ the scalar, or constant value. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MSE equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):

$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-(mx_i +c\right))^2$$

We then take the partial derivates with respect to m and c, the scalar and constant components, giving us:

$$\frac{\partial MSE}{\partial m} = \frac{-2}{n}\sum_{i=0}^n x_i (Y_i - \hat y_i)$$

$$\frac{\partial MSE}{\partial c} = \frac{-2}{n}\sum_{i=0}^n (Y_i - \hat y_i)$$

In order to iterate over our algorithm to train the model, we will perform the following steps: use our *epoch* to determine how many times we will check for the steepness of the hill, then use our learning rate, $L$, to determine the degree to which we update our scalar values.

- $m = m -L * D_m$
- $c = c -L * D_c$

Below we will walk through training our model 15 times, or 15 epochs with gradient descent. We will plot both the data and the corresponding best fit lines from each time step. We will also print out the respective scalar values and their corresponding MSEs. The python code is modifed from Adarsh Menon's excellent 2018 blog post[1].

Epoch #0: m = 1.3029690942817036; c = 0.020171926803486246; MSE = 11125.510732037144

Epoch #1: m = 1.38492617474331; c = 0.023533981198709558; MSE = 2100.5570855449128

Epoch #2: m = 1.3900542967346852; c = 0.02583827178686301; MSE = 2064.7755880223576

Epoch #3: m = 1.3903481598871632; c = 0.02807595843804381; MSE = 2064.584234839073

Epoch #4: m = 1.39033784070446; c = 0.030309407272782522; MSE = 2064.53379322249

Epoch #5: m = 1.3903083822971603; c = 0.0325425425157025; MSE = 2064.4839117351066

Epoch #6: m = 1.3902777202508474; c = 0.03477561108933626; MSE = 2064.4340346941194

Epoch #7: m = 1.3902469831149034; c = 0.03700862853168542; MSE = 2064.3841598993504

Epoch #8: m = 1.3902162419003055; c = 0.03924159582149052; MSE = 2064.3342873419874

Epoch #9: m = 1.3901855010752178; c = 0.041474513021388985; MSE = 2064.284417021897

Epoch #10: m = 1.3901547609207736; c = 0.04370738013637486; MSE = 2064.2345489389777

Epoch #11: m = 1.3901240214546475; c = 0.04594019716781509; MSE = 2064.1846830931318

Epoch #12: m = 1.390093282677937; c = 0.048172964116848384; MSE = 2064.1348194842553

Epoch #13: m = 1.3900625445906973; c = 0.05040568098459907; MSE = 2064.0849581122493

Epoch #14: m = 1.390031807192917; c = 0.05263834777219053; MSE = 2064.035098977013

As we can see from above, our first MSE was okay, then it improved substantially, improved again, and then really settled into the model we stayed with by the third epoch. Let's see how our training may go under a different loss function.

With mean absolute error, MAE, instead of squaring the errors, we take the absolute value of the error. The main difference between MSE and MAE is their responsiveness to outliers. If you want your model to be more influenced by outliers, use MSE, if you don't want outliers to have too much weight, you may be better served by the MAE loss function, shown below:

$$\textrm{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|$$

To optimize our model over the loss function, as in the MSE walkthrough, we need to solve our MAE for the constants of $m$, the slope term, and $c$ the scalar, or constant value. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MAE equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):

$$\textrm{MAE} = \frac{1}{n}\sum_{i=1}^n |Y_i - (mx_i +c)|$$

Now, we run into a wrinkle we didn't have in the MSE: the MAE isn't meaningfully differentiable since there is no derivative at zero we take the partial derivates with respective to m and c around 0.

$$ \frac{\partial MAE}{\partial m} = \begin{cases} -1 & \text{for } x_i(Y_i - \hat y_i) < 0 \\ +1 & \text{for } Y_i - \hat y_i > 0 \end{cases}$$$$ \frac{\partial MAE}{\partial c} = \begin{cases} -1 & \text{for } Y_i - \hat y_i < 0 \\ +1 & \text{for } Y_i - \hat y_i > 0 \end{cases}$$

What this practically means is that we are not optimizing $c$ and need to *increase* our learning rate from 0.0001 to 0.1 as we used in the MSE, to be be more response at each change, to make up for the lack of differentiability, gradient, we need to move in larger 'steps' per epoch.

Epoch #0: m = 0.1; c = 0.1; MAE = 100.85963401743122

Epoch #1: m = 0.2; c = 0.2; MAE = 94.31057927904259

Epoch #2: m = 0.3; c = 0.3; MAE = 87.76152454065421

Epoch #3: m = 0.4; c = 0.4; MAE = 81.21246980226589

Epoch #4: m = 0.5; c = 0.5; MAE = 74.66341506387745

Epoch #5: m = 0.6; c = 0.6; MAE = 68.11436032548895

Epoch #6: m = 0.7; c = 0.7; MAE = 61.56530558710057

Epoch #7: m = 0.8; c = 0.8; MAE = 55.01625084871213

Epoch #8: m = 0.9; c = 0.9; MAE = 48.467196110323655

Epoch #9: m = 1.0; c = 1.0; MAE = 41.918141371935214

Epoch #10: m = 1.1; c = 1.1; MAE = 35.36908663354676

Epoch #11: m = 1.2; c = 1.2; MAE = 28.820031895158365

Epoch #12: m = 1.3; c = 1.3; MAE = 22.270977156769924

Epoch #13: m = 1.4; c = 1.4; MAE = 15.721922418381459

Epoch #14: m = 1.3; c = 1.5; MAE = 9.172867679992994

Now that we've established how optimizing loss functions work, we will go into the topic of *multi-objective* loss functions. Up until this point, we had one equation or criteria that we were trying to minimize. However, as we have discussed in previous posts, when it comes to responsible AI, we want to ensure that our models are both performant and unbiased. In order to accomplish this, we need to optimize for multiple objectives in our loss functions. Multi-objective optimization is a very large, advanced topic with extensive work in the fields of engineering, economics, and logistics. For our purposes today, we will modify our MSE loss function to include a simple constraint that the predicted value cannot be greater than 120. In practicality, we would most likely be optimizing a multi-variate model for both an accuracy function, such as MSE, and a protected class variable for equalized odds (see our previous post on Top bias metrics and how they work).

$$\textrm{MSE_max} =\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2, \hat{Y_i} < 120 $$

Now, to perform gradient descent as described above, we need to solve our MSE_max for the constants of $m$, the slope term, and $c$ the scalar, or constant value, with the constraint of $\hat{Y_i} < 120$. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MSE max equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):

$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-(mx_i +c\right))^2$$

We then take the partial derivates with respective to m and c, the scalar or constant values:

$$\frac{\partial MSE}{\partial m} = \frac{-2}{n}\sum_{i=0}^n x_i (Y_i - \begin{cases} \hat y_i & \text{for } Y_i - \hat y_i > 120 \\ 120 & \text{for } Y_i - \hat y_i < 120 \end{cases})$$

$$\frac{\partial MSE}{\partial c} = \frac{-2}{n}\sum_{i=0}^n (Y_i - (Y_i - \begin{cases} \hat y_i & \text{for } Y_i - \hat y_i > 120 \\ 120 & \text{for } Y_i - \hat y_i < 120 \end{cases})$$

Epoch #0: m = 1.3029690942817036; c = 0.020171926803486246; MSE = 11125.510732037144

Epoch #1: m = 1.442144301979417; c = 0.024010316347749815; MSE = 1917.442210263628

Epoch #2: m = 1.4955802526488085; c = 0.02647448645966806; MSE = 1709.6243849370342

Epoch #3: m = 1.520488190537731; c = 0.028463009225149227; MSE = 1651.132706823289

Epoch #4: m = 1.5328372774364025; c = 0.03023889240754651; MSE = 1627.0645394102555

Epoch #5: m = 1.539132639411789; c = 0.03191150385295135; MSE = 1615.7126147043207

Epoch #6: m = 1.5424050609390116; c = 0.03353223899537106; MSE = 1609.971450646196

Epoch #7: m = 1.5441092936350949; c = 0.03512601454898867; MSE = 1607.0176496386148

Epoch #8: m = 1.5449929156620412; c = 0.036705665638760654; MSE = 1605.4734716233502

Epoch #9: m = 1.545446687313976; c = 0.03827790311578823; MSE = 1604.651889794666

Epoch #10: m = 1.5456741786500285; c = 0.03984622787016094; MSE = 1604.211963591956

Epoch #11: m = 1.5457819050863417; c = 0.041412474493953355; MSE = 1603.972386307603

Epoch #12: m = 1.5458262428097813; c = 0.04297761399508994; MSE = 1603.8385323390485

Epoch #13: m = 1.5458370304536697; c = 0.04454216030604839; MSE = 1603.7605466993352

Epoch #14: m = 1.5458300609278863; c = 0.046106385439669516; MSE = 1603.712106202352

When we compare the different optimized loss functions and their shape, we see that our multi-objective model ended up having the lowest MSE. This may not always be the case, but a trade-off in performance for a safe and fair model is normally a worthwhile tradeoff for a well-controlled model.

We have given a high-level introduction into different loss functions and how they function by creating univariate linear regression models off of randomly generated data. We walked through how a model is trained using gradient descent and provided the math and code for creating these loss functions and linear models from scratch. We concluded by introducing the complex topic of multi-objective modeling, which will be the focus of future posts. Coming soon in our series will be a discussion on debiasing training data with interpretable techniques as well as model validations.

For the sake of illustration and discussion, we skipped crucial modeling steps such as understanding business needs, data quality assessment, data segmentation, and model cross-validation, to only name a few steps.

For more information about considering these steps, subscribe to The AI Fundamentalists podcast and search for episodes about model robustness and performance.

[1] Menon, Adarsh. “Linear Regression Using Gradient Descent.” Medium, September 19, 2018. https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931.