Uncertainty in Deep Learning — Brief Introduction

Published in

Towards Data Science

9 min readFeb 1, 2022

When humans don’t know the answer for a specific question, they will say “I don’t know”. But is that also applied for the Deep Learning models?

In this post, I will try to give an intuition about uncertainty in Deep Learning models instead of explaining these uncertainties in depth. They will be explained in the next parts using TensorFlow Probability.

There will be series of articles which I will explain those terms and show TFP implementations.

First, What is Uncertainty?

Uncertainty can be defined as the lack of knowledge or certainty about something. It is an unavoidable part of life and omnipresent in both natural and artificial systems.

In the context of Deep Learning there are two main types of uncertainties:

1) Aleatoric Uncertainty: This is uncertainty due to the randomness in the data.
2) Epistemic Uncertainty: This is uncertainty due to the lack of knowledge about the true parameters of the model.

Example of uncertainty in the predictions. Image by author

When we train our deep learning models, we employ Maximum Likelihood Estimation (MLE).

In a nutshell, MLE is a method of estimating the parameters of a statistical model from a set of data. It is a technique that finds the value of the parameters that produces the best possible match between the data and the model.

In other words, we search for weights that explains the data well. Or, given the data I have, what are the optimal weights? This is a classic optimization problem and it is a stochastic process. But the question is, are there multiple set of weights that explains data well?

Basic Linear Regression

First, we create a linear data with some random noise.

Given the data we have, there should be different model combinations. First let’s define a utility function. This function will take 3 separate models and train them using same loss and optimizer. Then it will print the prediction result for 15, and plot the lines.

We actually did not need relu activations as the dataset is linear. I was experimenting some models and their final form had relu so I did not remove them in the end.

As you can recognize, every model has a different starting point because the initializers are different. Thus, the learnt weights should be different but not that much different.

When we run the script above we will get:

Model1: Prediction for 15: [[31.22317]]
Model2: Prediction for 15: [[30.969236]]
Model3: Prediction for 15: [[31.227913]]

All of the lines seem reasonable, even though the learnt weights are different. That’s an uncertainty, there is more than one model to explain the given data depending on the starting point. In other words, the weights are not certain!

The estimated parameters depend on mainly two things:

Dataset Size
Starting Point of Gradient Descent

As you may guess when there is more data, the estimated parameters will be more accurate.

If you think about bigger models and data, in every re-run (fitting process) most of the time the results will be similar but weights would not be exactly the same.

Basic Image Classification and Softmax

This time, I will show a different example where the task is a multiclass classification. For this reason I will use fashion_mnist as the basic example.

After loading and processing the data, a simple model can be written as follows:

Model summary looks like this and there is nothing special:

Layer (type)                Output Shape              Param #   
=================================================================
 conv2d (Conv2D)             (None, 28, 28, 16)        160       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 14, 14, 16)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 14, 14, 32)        4640      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 7, 7, 32)         0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 7, 7, 64)          18496     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 3, 3, 64)         0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 3, 3, 128)         73856     
                                                                 
 global_max_pooling2d (Globa  (None, 128)              0         
 lMaxPooling2D)                                                  
                                                                 
 dense_6 (Dense)             (None, 128)               16512     
                                                                 
 dense_7 (Dense)             (None, 10)                1290      
                                                                 
=================================================================
Total params: 114,954
Trainable params: 114,954
Non-trainable params: 0# Last Epoch
Epoch 16/16
469/469 [==============================] - 6s 12ms/step - loss: 0.1206 - acc: 0.9551 - val_loss: 0.3503 - val_acc: 0.9014

Let’s ignore the overfitting part here, and plot some of predictions’ softmax outputs. Before that we shall take few samples and labels from the dataset and predict them.

We take the first batch and predict.

For the first plot, the correct label was 1. And the model outputted such a high softmax value for that sample. Although the softmax values are high, this does not tell how sure the model is.

In order to test this, a random noise vector will be given to the model. And the we will discuss the softmax outputs.

(I will explain the parameter ensemble for the second model in later parts)

{0: 6.912527, 1: 0.0038282804, 2: 0.9346371, 3: 0.7660487, 
 4: 4.7964582, 5: 3.58412e-05, 6: 80.114265, 7: 0.0002095818, 
 8: 6.4642153, 9: 0.007769133}

According to the softmax output, there is 80% probability that this given vector belongs to class 6.

The misconception about the softmax is that the higher the value, the more confident the model is [1]. This is not always the case. Softmax value can be very high while the prediction is wrong. For example:

Adversarial Attacks
Out of Distribution Data (Example: Random Noise Vector)

Also, in the regression example we discussed, there are multiple model weights exist to explain the data depending on the starting point of the gradient descent. Here, we totally ignored them because the model has point-estimate weights and we have only one model.

So now the question arises, what if we took into account other possible models and ensembled them to get a prediction? Say, what if we ensembled 100 different models for this data?

Ensembling 100 Models

Training 100 different deep learning models are not practical if you consider real-life projects. We can think a hacky way to get different predictions from a single point-estimated model.

As you know, dropout is a regularization technique employed in deep learning in order to prevent overfitting. It works by randomly dropping units (neurons) out of the network during the training process, with the hope that this will prevent any one neuron from becoming too influential. I will explain how can we use this to get different models and ensemble them.

Before that, let’s create the same model and add dropouts.

The dropout layer is not active while getting prediction from the model. What if we use the effect of the dropout layer at the same time predicting an input?

If we use __call__ method of the model, we still may keep them active:

for _ in range(4):
    predict_noise = model2(tf.expand_dims(random_vector,axis = 0),
                        training = True).numpy().squeeze()
    print('Softmax output for class 0:', predict_noise[0] * 100)

With training = True we keep active Dropout layers as if the network is training but we only apply forward pass. The code above will print these results:

Softmax output for class 0: 3.7370428442955017
Softmax output for class 0: 4.5094069093465805
Softmax output for class 0: 0.9782549925148487
Softmax output for class 0: 1.7607659101486206

That’s because of the dropout layer, some of neurons are dropped randomly. So, as a naive approach we can use this tactic to get different predictions as if we have 100 different models.

The first model was completely deterministic, if you predict the same sample you would get the same output everytime, and as it had no dropout layers there was no practical way to get different predictions. But now, we can get different predictions from a single model!

Recall that we had a utility function to predict softmax output for the given noise vector:

Now we passed the model2 as the model parameter and 100 as the ensemble size. So now we get predictions from model 100 times while keeping the dropout layers active. And for the final prediction, we will just take mean of the predictions.

{0: 4.0112467, 1: 0.4843149, 2: 11.618713, 3: 8.531735, 
 4: 8.837839, 5: 0.04070336, 6: 59.83536, 7: 0.96540254, 
 8: 4.8854494, 9: 0.7892267}

Well, the results are better than the other model but not the best. That’s because even we used kind of an ensemble technique the models’ weight had point-estimate values.

Let’s predict the first batch with the new model and inspect the results.

predictions = []
for _ in range(1000):
    predicted_ensemble = model2(samples,
                        training = True).numpy().squeeze()
    predictions.append(predicted_ensemble)
predictions = np.array(predictions)
print('Predictions shape:', predictions.shape) # (1000, 128, 10)
predictions_median = np.median(predictions, axis = 0)

Instead of mean, we take the median as the final prediction. When we plot the predictions, they look like this:

If we compare these predictions with the other one, we see that some of the predictions have changed slightly. That’s because we kept active dropout while getting predictions which enabled us to have random ensemble models.

Now we can get multiple different predictions for a single input. That means we can generate confidence intervals from a set of predictions.

Dropout to Obtain Confidence Intervals

Using a for loop, we can get predictions for the samples:

I will add some sample plots to comment on them:

This plot shows us, even we ensembled 1000 predictions for a specific input the range of 95% interval was the same as the median and normal prediction. So we may say that this sample really belongs to the class 1. and uncertainty is low.

Let’s inspect another example:

What just happened? Normal model had a softmax output which is almost 1.00 When we inspect the 95% interval we see that green bars are taller than before. That’s because with dropout model we were able to catch some uncertainty in the predictions.

The conclusion is that as these green bars are taller, the prediction is less certain.

Last, let’s inspect this prediction. There was a little probability for the class 0 when we used a single point-estimate model. When we applied ensembling technique we saw there was a huge uncertainty in the predictions. Notice how tall are the green bars!

Conclusion

We discussed that:

There are set of weights depending on the starting point of the gradient descent, assuming the dataset is the same.
Vanilla dropout ensembling is not the best way because it is somewhat random.

Next Steps

In the following posts, we will use a more systematic way to represent both aleatoric and epistemic uncertainty using TensorFlow Probability while explaning those terms in detail.

You can check the used Jupyter-Notebook from here.

References

[1]: Tim Pearce, Alexandra Brintrup Jun Zhu, Understanding Softmax Confidence and Uncertainty, 2021.