Batch, Mini Batch, and stochastic gradient descent
A guide on gradient descent and its types.
What is Optimizer?
Before we look into gradient descent, let’s see what an optimizer is.
An optimizer is an algorithm used to update the neural network attributes to reduce the losses.
- Role of optimizer: Optimizer is responsible for reducing the losses and providing the most accurate results by calculating the neural network’s attributes.
Different types of Gradient descent
- Batch Gradient descent
- Mini batch Gradient descent
- Stochastic Gradient descent
These vary based on how we are passing the data to optimize it.
Batch Gradient Descent
Batch gradient descent, aka Vanilla gradient descent, is a gradient descent variation that uses the entire training sample to take a single step.
In batch gradient descent, the entire dataset is sent for each epoch, and an update of the Model is done after completing the training of all samples in each epoch.
- Take entire dataset
- Feed it to Model
- Calculate the Gradient Descent.
- Use the calculated Gradient Descent to update the Model.
- Repeat the (1–4)steps for several epochs.
Pros of BGD
- BGD is computationally efficient when compared with other gradient descents.
- The decreased update frequency results in a more stable error gradient, resulting in stable convergence on some problems.
Cons of BGD
- BGD is not suggested for huge training samples as it takes time for training since it uses the whole dataset to perform just one update and the training.
- BGD takes up more memory for a larger dataset as it has to store the entire dataset for training.
Mini Batch Gradient descent (MGD)
Mini Batch gradient descent is a variation in gradient descent that splits the training sample into multiple mini-batches.
All the mini-batches will be sent for each epoch, but the error is calculated at the end of each batch. This helps in updating the parameters periodically than waiting for the entire dataset to be sent.
The cost function may appear a little noisy when compared with the Batch Gradient descent.
- Take the entire dataset.
- Split the dataset into Mini-batches.
- Pick a mini-batch and feed it to the neural network.
- Calculate the Gradient Descent.
- Use the calculated Gradient Descent to update the network.
- Repeat the (3–5)steps for several mini-batches.
- Repeat the (3–6)steps for the number of epochs.
Pros of MGD
- The model update frequency is higher than batch gradient descent, allowing for a more robust convergence, avoiding local minima.
- The batched training of samples is more efficient than Stochastic gradient descent.
- The splitting into batches returns increased efficiency as it is not required to store entire training data in memory.
Cons of MGD
- Mini-batch requires an additional “mini-batch size” hyperparameter for training a neural network.
Stochastic Gradient Descent
In Stochastic Gradient Descent (SGD), we consider one sample at a time, which means SGD will update the neural network parameters after passing each sample.
The Cost function will fluctuate a lot compared with other gradient descent methods, as we consider each sample to update the neural network.
We do the following steps for SGD:
- Take each sample from the dataset.
- Feed it to Neural Network
- Calculate gradient descent.
- Use the gradient calculated in step 3 to update weights.
- Repeat 1–4 steps for all the samples in the training dataset.
- Repeat steps (1–5) for the given number of epochs
Pros of SGD :
- By frequently updating the neural network, we will get insight into our Model’s performance.
- For some cases, frequent updating may help in faster learning.
- SGD also can help in avoiding local minima.
Cons of SGD :
- Frequent Updates of the Model may be computationally expensive.
- Training time may be longer for a large dataset.
- The regular updates can cause a noisy gradient, which may create a higher variance over training epochs.
All the different variants of gradient descent have their pros and cons. There is no definite method, and we can use a particular approach. Based on the data and strategy, we can choose a method to reduce our model loss.