introduction
In machine learning, the goal is to develop models that can identify patterns in data, make predictions, answer questions, or uncover hidden insights. This learning process involves training the model on data, testing it on unseen examples, and evaluating its performance. A critical component of this process is the loss function, which quantifies the difference between the model’s predictions and the actual outcomes. The loss function plays a pivotal role in optimizing the model’s parameters, ensuring that the model can learn from its errors and improve its predictions.
1. A Review of Linear Regression
Before diving into loss functions in linear regression, it’s important to first have a solid understanding of how linear regression works. If you’re not yet familiar with it, I recommend checking out “Linear Regression: A Step-by-Step Guide with Python”, which covers the fundamentals. This article builds on that foundation, focusing on the mathematical aspects of loss functions and providing examples to show their role in optimizing linear regression models.
Now, let’s begin with an example. Here, we have a graph where the horizontal axis represents SAT math scores as the independent variable, and the vertical axis represents college GPA scores as the dependent variable.

Since you already know the basics of linear regression, we can proceed directly to the calculation. The linear regression equation derived from the data in the graph is:

Now that we have the best-fit line, what happens if the predicted values do not match the actual data points?

Does that mean the line isn’t truly the best fit? To answer this, we now turn to the concept of the loss function.
2. What is loss function?
The best-fit line doesn’t always pass through every data point. To determine the optimal line, we use the loss function, which measures how far the model’s predictions are from the actual data points. The loss function essentially compares the predicted values to the observed values, giving us an idea of how well the model is performing. It evaluates the model’s ability to correctly capture the relationship between X (the independent variable or predictor) and Y (the dependent variable or target).
Referring back to our example, the error for each data point is calculated by subtracting the predicted value from the actual value:

where: “yi” represents the actual value, “yi-hat” represents the predicted value.
3. Loss Functions for Regression
The kind of loss function you are going to use depends on the kind of problem you are working i.e Regression or Classification. Below you will find some commonly used loss functions for regression.
3-1. Mean Absolute Error (MAE):
The Mean Absolute Error (MAE), also referred to as L1 loss, is one of the most straightforward loss functions used in machine learning. It is computed by determining the absolute difference between the predicted values and the actual values, followed by averaging the results across the dataset. In mathematical terms, MAE is the arithmetic mean of the absolute errors, which means it calculates the average of all the errors by adding up the absolute differences and dividing by the total number of data points.
The mathematical representation of MAE is given as:

where: “n” denotes the sample size.
3-2. Mean Squared Error (MSE):
As the name suggests, MSE ,also known as L2 loss, calculates the difference between the actual value yi and the predicted value ŷi , but instead of taking the absolute value, it squares the error. This makes MSE effective at handling outliers because squaring the error increases the impact of large mistakes, making them more noticeable and giving them more weight in the calculation.
The mathematical representation of MSE is given as:

MSE vs. MAE: How to Choose the Right Loss Function
If the outliers represent anomalies that are important for business and should be detected, then we should use MSE. On the other hand, if we believe that the outliers just represent corrupted data, then we should choose MAE as loss.
3-3. Huber Loss:
As mentioned earlier, some loss functions are more sensitive to outliers, while others tend to ignore them. Let’s consider a case where 70% of the data points are in one direction, and the remaining 30% are in another. Technically, this data doesn’t have any outliers, but the Absolute Loss function might treat those 30% as outliers and ignore them. On the other hand, the Squared Loss function will try to fit all points, possibly giving too much weight to the 30% and affecting the model’s performance. A better approach in this case is the Huber Loss, which offers a balanced solution.

Huber loss combines the advantages of quadratic and linear loss functions, providing a balanced way to measure errors. It uses a parameter called 𝛿 (delta) to determine when to switch from quadratic to linear behavior. When the error is smaller than 𝛿, the Huber loss behaves like Mean Squared Error (MSE), squaring the error to calculate the loss. For larger errors, it switches to a linear approach, similar to Mean Absolute Error (MAE).
The mathematical representation of Huber Loss is given as:


Huber loss is useful in these situations because it combines the benefits of MSE and MAE, making it more resistant to outliers by curving around the minimum and reducing the gradient. However, it requires adjusting the delta hyperparameter, which can involve some trial and error.
3-4. Log-Cosh Loss:
As mentioned earlier, Huber Loss combines MSE for small errors and MAE for large ones, with a cut-off point δ that requires tuning. In contrast, Log-Cosh Loss uses the logarithm of the hyperbolic cosine, growing slowly for large errors without needing a cut-off. This makes it simpler and more stable to use. In short, Huber Loss handles outliers with a cut-off point, while Log-Cosh Loss offers a smooth, parameter-free alternative.
The mathematical representation of Log-Cosh Loss is given as:

It’s similar to Mean Squared Error (MSE), but with one key difference; it’s less sensitive to large outliers. Let’s break it down using the SAT math scores (independent variable) and college GPA scores (dependent variable) from our example.
The log-cosh loss for each prediction is calculated as:

For the student with an SAT score of 700, where the model predicted GPA Y-hat=3.5, and the actual GPA was Y=3.4, the error is 3.5−3.4=0.1; Using the log-cosh formula, we calculate the log-cosh loss:

This is a very small loss, showing that the model’s prediction was quite close to the actual value.
Now consider if the model predicted a GPA of 4.0 for that same SAT score (an error of 0.6), the log-cosh loss would be higher but still less sensitive than MSE:

Although this loss is higher than before, it’s still much smaller compared to how quickly MSE increases with larger errors.
3-5. Quantile Loss:
Quantile regression helps predict specific quantiles, providing a range of possible outcomes rather than a single point estimate. This is particularly useful when we want to understand the uncertainty in predictions. Unlike linear regression, which assumes constant error variance, quantile regression handles cases where errors are uneven or don’t follow a normal distribution, offering reliable prediction intervals. This makes it a valuable tool for more complex, real-world problems where understanding a range of possibilities is crucial for better decision-making.
The mathematical representation of Quantile Loss is given as:

4. Conclusion
Choosing the right loss function depends on the nature of your data and the problem you are solving. MAE is simple and robust to outliers, while MSE penalizes large errors more heavily. Huber Loss offers a middle ground, and Log-Cosh provides a stable, parameter-free alternative. Quantile Loss is best for predicting ranges, providing more flexibility in uncertain environments.
By understanding the strengths and weaknesses of these loss functions, you can better optimize your regression models for improved performance.
In summary, the following table compares the key characteristics of these loss functions:
Loss function | Outlier Sensitivity | Sensitivity to Small Errors | Recommended Use Case |
MAE | High | Low | Suitable when data has outliers or asymmetric distribution |
MSE | Low | High | Best for data with normally distributed errors |
Huber Loss | Medium | Medium | Balanced option, suitable for both stability and robustness |
Log-Cosh | Low | Smooth | Ideal for cases requiring a smoother gradient than Huber Loss |
Quantile Loss | Varies by Quantile | Depends on Quantile | Used in uncertainty prediction and financial risk analysis |