Loss Functions used in Artificial Intelligence

9 min readApr 13, 2021

After training your Machine Learning model it will give some results but, how do you know that those are good results? This is where Loss Function comes into the picture of Artificial Intelligence. Loss Functions evaluate your model and works like a metric but instead of showing how good your model is, it focuses on how bad it is. It represents the cost paid for the inaccuracy of predictions made by the model. So, the lesser the loss, the better is your model. Basically, it gives the measure of error between your model’s predictions and the given target values. Using optimization functions we reduce the error in our predictions.

There are two types of loss functions: Regression loss functions and Classification loss functions. We will discuss some loss functions of each type.

Regression Loss Functions:

Mean Squared Error: Mean Squared Error (MSE) is equal to the average squared difference (distance or error) between the predicted and target (actual) values. For a better model MSE should be minimum.

MSE is very popular metric used in regression problems. The squaring of the error is beneficial as it always gives positive value, so the sum would not be zero. Also, it highlights the larger differences — which can be both good and bad (it ensures that our trained model does not contain outliers whereas a single outlier magnifies the error). It is also called the L2 loss.

2. Mean Absolute Error: Mean Absolute Error (MAE) is equal to the average absolute difference (distance) between the predicted and target (actual) values.

MSE and MAE are also the most common loss functions for regression problems. Since MAE does not use square it is more robust to outliers than MSE. It is also known as L1 loss.

3. Huber Loss: It is the combination of absolute and square errors. When the error is small the loss is quadratic and when the error is large the loss is linear. The threshold for identifying the error as small or large is the parameter. Compared to the square error loss, Huber loss is less sensitive to the outliers.

The value of δ determines what prediction is considered as an outlier. The value of δ should be increased if a smaller number of outliers are preferred, since more predictions would lie under the square error rather than the absolute error.

4. Log cosh Loss: It is equal to the logarithm of the hyperbolic cosine (cosh) of the error between the predicted output value and actual output value. This loss function is smoother than the MSE (L2 loss).

For small values of x, log(cosh(x)) is equal to x²/2 approx. and for large values of x, it is equal to |x| — log(2). Thus, log cosh loss is mostly equal to MSE but is not affected by outliers.

Classification Loss Functions:

Cross Entropy Loss: It is also known as Log-loss error. Log loss is a very common loss function used in classification problems. It escalates as the predicted probability varies from the actual label. Its formula is as follows:

Here p is the probability of the positive class (y = 1) and N is the number of samples in the dataset.

When y = 0 the first term becomes zero, and when y = 1 the second term equals to zero. Hence it is basically the log of the predicted probability.

2. Hinge Loss: This loss function is mainly used for Support Vector Machines (SVM) classifiers. Hinge loss penalizes wrong predictions as well as correct predictions which are unconfident. The class labels used in SVM are -1 and +1, so the labels of the dataset should be re-scaled accordingly. It is also known as SVM Loss.

The formula for hinge loss is as follows:

Here t is the actual output (target), and y is the output predicted by our classifier.

Multi Class Classification Loss Functions:

Not all problems have only two classes there could be multiple output classes for some problems. For example, in a fruit image dataset you could have to classify more than 2 fruits like apple, orange and mangoes or the digits classification problem. In this multi-class classification, you would use loss functions for the same and not for simple classification, right?

Multi Class Cross Entropy Loss: It is same as the binary cross entropy loss only the number of classes increase. It is also known as categorial cross entropy loss. The formula is given below where X is the input vector and Y is the one-hot encoded target vector.

2. KL-Divergence: The Kullback-Leibler (KL) Divergence score measures how much a probability distribution varies from another, a reference, probability distribution. It is also known as relative entropy. It is computed as the negative sum of the probability of each event in P multiplied by the logarithm of the ration of probability of each event in P to probability of each event in Q. Here P and Q are the two probability distributions and ‘||’ denotes divergence.

The value inside the sum is equal to the divergence for a given event.

KL divergence is not symmetric, that is

Loss Functions used in Image Segmentation:

Pixel-wise Cross Entropy Loss: We have already discussed cross entropy loss, the difference here is that this loss considers each pixel separately, comparing the model’s prediction against the target labels, for an image. The formula is same as log loss and the error calculation is repeated for all pixels and finally the average is considered. This is the most common loss function used for image segmentation problems. Similarly Mean Square Error (MSE) and Mean Absolute Error (MAE) can be applied for each pixel individually for each predicted and target pair and then the final loss could be considered average over all the pixels.
Sørensen Dice Coefficient: This loss function is used to measure the similarity or overlap between two samples. Its values range from 0 to 1, where 1 represents perfectly similar or complete overlap. Its formula is as follows:

Where |A ∩ B| denotes the mutual elements between sets A and B, and |A| denotes the number of elements in set A (and similar for set B).

In case of images, we consider the numerator equal to the element wise multiplication of the pixel matrices of the two samples. And the number of elements of each set would be equal to the sum of elements of the pixel matrix.

In image segmentation, this coefficient is used to compare the output of our model against the reference masks in medical purposes.

It is also known as Sorenson Index, Dice’s coefficient, and F1 score.

Loss Functions used in Computer Vision:

Perceptual Loss Function: This loss function is employed when a comparison between two similar images is to be performed. For example, when comparing the same image by itself, when it is shifted by one pixel or is of different resolution. Here the pixel-wise loss functions would result in a high value of error. This is where Perceptual loss function comes into picture to save your model. It compares semantic and high level perceptual differences between the two images. The formula of perceptual loss is equal to the squared and normalized Euclidean distance between the feature representations.

*Source:* Perceptual Losses for Real-Time Style Transfer and Super-Resolution

2. Content — Style loss function: First we will discuss about style transfer technique. The method of portraying the semantic content of an image into different styles is known as style transfer. In simple words, you take a content image, whose style you want to change for example an image of some houses besides a lake, and you take a reference image, whose style you want in your image for example Van Gogh’s ‘The Starry Night’. The output image would be the image of those houses in Van Gogh’s The Starry Nigh style. You can see an example image with three different styles in the below image.

Example image of Style Transfer — Source: A Neural Algorithm of Artistic Style

We know that on higher levels CNNs depict the content information and on the lower levels the focus is on individual pixel values. So, the activation maps for the original content image (C) and for the predicted output (P) is computed, and the content loss function is calculated as follows:

Loss Functions used in Natural Language Processing (NLP):

Natural language processing is the field where the computer processes and analyzes natural language data. It is the interaction between computer and human language. Machine translation, automatically translate text from one language to another, is a very significant application of NLP.

BLEU Score: No BLEU does not mean some random gibberish, like Joey trying to speak French…

For those who didn’t understand the reference: Joey Tribbiani trying to speak French in the show FRIENDS

Bi-Lingual Evaluation Understudy (BLEU) is an algorithm used to evaluate the output translated text which was translated by the machine from one language to another, for example from French to English.

The score is computed by evaluating the translated text segment with reference translations. These segments are usually sentences. The average of these scores is computed over the complete collection of texts to evaluate the quality of the translation. This score ranges from 0 to 1, 1 being denoting complete similarity of the candidate translation (predicted output) with the reference translation (target label). It works on a modified form of precision.

Candidate: the the the the the the the

Reference 1: the cat is on the mat

Reference 2: there is a cat on the mat

Using normal precision, we would say the words in candidate translation were present in both reference translations. Thus, the precision would be 7/7 = 1. This is a perfect score, but we can clearly observe that the candidate translation does not make sense with the content of the references and one word is repeating seven times. So, we take the maximum count of the word appearing in the references, here it is equal to 2 for ‘the’ and total word count for the candidate is 7. Thus, the modified unigram precision score would be: 2/7.

BLEU score uses n-gram to calculate the modified precision metric.

2. ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is used for the evaluation of the automatic summarized or machine translation with the reference. It is a modified version of BLEU. ROUGE concentrates on recall instead of precision, it measures the number of n-grams in the reference translation appearing in the output translation.

So, this is the end, the end of the article not the end of your model. After choosing a loss function you have to choose an optimizer, which will change the weights of your model in order to decrease the loss. Choosing a loss function completely depends on your application. There are a lot of loss function so, choose wisely.

Thank you for reading my article! I hope it increased your knowledge. If you are interested in Artificial Intelligence you can read my other articles. Goodbye!