The loss function

Rafael Irizarry

The loss function

Rafael Irizarry

In this section, we describe how the general approach to defining “best” in machine learning is to define a loss function, which can be applied to both categorical and continuous data.

The most commonly used loss function is the squared loss function. If $\hat{y}$ is our predictor and ${y}$ is the observed outcome, the squared loss function is simply:

$(\hat{y} - y)^2$

Because we often have a test set with many observations, say $N$ , we use the mean squared error (MSE):

$\mbox{MSE} = \frac{1}{N} \mbox{RSS} = \frac{1}{N}\overset{N}{ \underset{i=1}{\sum}} (\hat{y}_i - y_i)^2$

In practice, we often report the root mean squared error (RMSE), which is $\sqrt{MSE}$ , because it is in the same units as the outcomes. But doing the math is often easier with the MSE and it is therefore more commonly used in textbooks, since these usually describe theoretical properties of algorithms.

If the outcomes are binary, both RMSE and MSE are equivalent to one minus accuracy, since $(\hat{y} - y)^2$ is 0 if the prediction was correct and 1 otherwise. In general, our goal is to build an algorithm that minimizes the loss so it is as close to 0 as possible.

Because our data is usually a random sample, we can think of the MSE as a random variable and the observed MSE can be thought of as an estimate of the expected MSE, which in mathematical notation we write like this:

$\mbox{E}\left\{ \frac{1}{N}\overset{N}{ \underset{i=1}{\sum}} (\hat{Y}_i - Y_i)^2 \right\}$

This is a theoretical concept because in practice we only have one dataset to work with. But in theory, we think of having a very large number of random samples (call it $B$ ), apply our algorithm to each, obtain an MSE for each random sample, and think of the expected MSE as:

$\frac{1}{B}\overset{N}{ \underset{b=1}{\sum}}\frac{1}{N}\overset{N}{ \underset{i=1}{\sum}} \left(\hat{y}_i^b - y_i^b\right)^2$

with $y_i^b\right$ $y_{i}^{b}$ denoting the $i^{\text{th}}$ observation in the $b^{\text{th}}$ random sample and $\hat{y}_i^b$ the resulting prediction obtained from applying the exact same algorithm to the $b^{\text{th}}$ random sample. Again, in practice we only observe one random sample, so the expected MSE is only theoretical.

Note that there are loss functions other than the squared loss. For example, the Mean Absolute Error uses absolute values, $|\hat{Y}_i - Y_i|$ instead of squaring the errors $(\hat{Y}_i - Y_i)^2 \right$ . However, in this book we focus on minimizing square loss since it is the most widely used.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

License

Share This Book