The Mystery of Vanishing and Exploding gradients

Artificial Neural Networks as we know were invented in 1943 to mimic our biological Nervous system to help machines learn as humans do.
But it was not until 1975 that we were able to actually make machines learn and recognize patterns in a data, with the famous Back-Propagation Algorithm came a new hope of training of multi-layered networks.
It allowed researchers to train supervised deep artificial neural networks from scratch, although with a little success. The Problem for this low accuracy of training the ANN using Back Propagation was later identified by Sepp Hochreiter’s in 1991.

The Problem

  • The Vanishing Gradients
  • The Exploding Gradients

The Vanishing Gradient Problem

It is a common phenomena with gradient based Optimisation techniques and it affects not only the many-layered feed forward networks, but also recurrent networks.

In Deep Neural Networks adding more and more hidden layers makes our network to learn more Complex arbitrary functions and features and therefore have higher accuracy while predicting the outcomes or identifying a pattern/feature in a complex data such as Image and Speech.

But, adding a layer comes at a cost which we refer as the Vanishing Gradient.
The Error that is back propagated using the Back Propagation Algorithm might become so small by the time it reaches the input of the model that it may have very little effect. This phenomena is called the Vanishing Gradient Problem
This make it difficult to know which direction the parameters/weights should move to improve the cost function therefore causes premature conversion to a poor solution.

below is an example of how mathematically back propagation works for a 4 hidden layer network.

Some time it might so happen that ∂J/∂b1 becomes equal to zero, and hence may not contribute towards updation of weights, thus causing a premature end to the learning of the model.

The Exploding Gradients Problem

Let’s now talk about another scenario that is very common with Deep Neural Nets that leads to failure of model training.

Sometimes it might so happen that while updating the weights error gradients can accumulate and result in Large gradients, this is in turn result in large update of weights and therefore make a network unstable, worst case scenario being that the value of weights become NaN.

 

Solution to the Vanishing Gradient Problem

  1. The Simplest solution is to use a ReLU function  that doesn’t cause a small
  2. derivative or use a Residual Network as they provide residual connections straight to earlier layers.

the residual connection directly adds the value present at the beginning of the block, x, to the end of the block (F(x)+x) thus residual connection doesn’t have to go through the  activation functions that “squashes” the derivatives, resulting in a higher overall derivative of the block.

3. adding a Batch Normalization Layer also resolves the problem of Vanishing gradients as it normalizes the input .

Identifying the Exploding Gradient Problem

A few of the simple checks may help you identify if the training is undergoing the problem of Exploding gradient problem

1. The model is unable to get traction on your training data (e.g. poor loss).
2. The  model is unstable, i.e large changes in loss from update to update.
3. Model Loss and weights goes to NaN during training
4. The error gradient values are consistently above 1.0 for each node and layer during training.

 

Solving the Problem of Exploding Gradients

  1. When gradients explode, the gradients could become NaN because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply Gradient Clipping; which places a predefined threshold on the gradients to prevent it from getting too large, and by doing this it doesn’t change the direction of the gradients it only change its length.
  2. Network Re-designing
    Using a smaller Batch- Size while training might show some improvement in tackling the Exploding Gradients.
  3. LSTM Networks
    Using LSTMs and perhaps related gated-type neuron structures are the new best practices to avoid exploding gradients in networks.
  4. Weight Regularization
    if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

 

One thought on “The Mystery of Vanishing and Exploding gradients

Leave a comment