Neural networks are trained using gradient descent where the estimate of the error is used to update the weights and is calculated based on a subset of the training dataset. The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.
In the case of mini-batch gradient descent, popular batch sizes include $32$, $64$, and $128$ samples. You may see these values used in models often in deep learning literature.
Advantages:
Downsides:
Advantages:
Disadvantages:
Advantages:
Disadvantages:
We will use a small multi-class classification problem as the basis to demonstrate the effect of batch size on learning.
The scikit-learn class provides the make_blobs()
function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.
The problem can be configured to have two input variables (to represent the $x$ and $y$ coordinates of the points) and a standard deviation of $2.0$ for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.
# scatter plot of blobs dataset
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
# generate 2d classification dataset
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
# scatter plot for each class value
for class_value in range(3):
# select indices of points with the class label
row_ix = np.where(y == class_value)
# scatter plot for points with a different color
plt.scatter(X[row_ix, 0], X[row_ix, 1], label=class_value, alpha=0.5)
plt.legend()
# show plot
plt.show()
We can see that the standard deviation of $2.0$ means that the classes are not linearly separable (separable by a line) causing many ambiguous points.
This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different “good enough” candidate solutions.
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from keras.utils import to_categorical
# prepare train and test dataset
def prepare_data():
# generate 2d classification dataset
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
# one hot encode output variable
y = to_categorical(y)
# split into train and test
n_train = 500
train_X, test_X = X[:n_train, :], X[n_train:, :]
train_y, test_y = y[:n_train], y[n_train:]
return train_X, train_y, test_X, test_y
# fit a model and plot learning curve
def fit_model(train_X, train_y, test_X, test_y, n_batch, epochs, opt):
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=epochs, verbose=0,
batch_size=n_batch)
# evaluate the model
scores = model.evaluate(test_X, test_y)
accuracy = round(scores[1] * 100, 3)
# plot learning curves
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.title(f'Accuracy = {accuracy}%, Batch Size = {n_batch}' + \
f', \n# of Epochs = {epochs}', pad= -40)
plt.legend()
# prepare dataset
train_X, train_y, test_X, test_y = prepare_data()
# create learning curves for different batch sizes
batch_sizes = [5, 10, 16, 32, 64, 128, 256, len(X)]
# set optimizer
opt = SGD(lr=0.01, momentum=0.9)
plt.figure(figsize=(12,12))
for i in range(len(batch_sizes)):
# determine the plot number
plot_no = 420 + (i+1)
plt.subplot(plot_no)
# fit model and plot learning curves
fit_model(train_X, train_y, test_X, test_y, batch_sizes[i], epochs=200, opt=opt)
# show learning curves
plt.tight_layout()
plt.show()
Running the example creates a figure with eight line plots showing the classification accuracy on the train and test sets of models with different batch sizes when using mini-batch gradient descent.
The plots show that small batch results generally in rapid learning but a volatile learning process with higher variance in the classification accuracy. Larger batch sizes slow down the learning process but the final stages result in a convergence to a more stable model exemplified by lower variance in classification accuracy.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
$1$ epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. An epoch is comprised of one or more batches. For example, as above, an epoch that has one batch is called batch gradient descent.
You can think of it as a for-loop over the number of epochs where each loop proceeds over the training dataset. Within this for-loop is another nested for-loop that iterates over each batch of samples, where $1$ batch has the specified “batch size” number of samples to estimate error and update weights.
The number of epochs is traditionally large, often hundreds or thousands, allowing the learning algorithm to run until the error from the model has been sufficiently minimized. You may see examples of the number of epochs in the literature and in tutorials set to $10$, $100$, $500$, $1000$, or even larger.
We will use the same dataset we used to look at batch size to look at the effect of the number of epochs on the models learning.
# create learning curves for different epoch sizes
n_epochs = [10, 20, 32, 64, 128, 200, 400, 500]
# set optimizer
opt = SGD(lr=0.01, momentum=0.9)
plt.figure(figsize=(12,12))
for i in range(len(n_epochs)):
# determine the plot number
plot_no = 420 + (i+1)
plt.subplot(plot_no)
# fit model and plot learning curves
fit_model(train_X, train_y, test_X, test_y, 256, epochs=n_epochs[i], opt=opt)
# show learning curves
plt.tight_layout()
plt.show()
Running the example creates a figure with eight line plots showing the classification accuracy on the train and test sets of models with a different number of epochs.
The plots show that a small number of epochs results in a volatile learning process with higher variance in the classification accuracy. A larger number of epochs result in a convergence to a more stable model exemplified by lower variance in classification accuracy.
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.
How you should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.
Available TensorFlow Optimizers (weight update rule):
It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example.
Stochastic gradient descent (SGD) performs a parameter update for each training example $x^{(i)}$ and label $y^{(i)}$
$$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i)}; y^{(i)}) $$As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.
Advantages:
Disadvantages:
One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate $\eta$ for each parameter and at every time step $t$. It’s a type second order optimization algorithm. It works on the derivative of an error function.
$$ g_{t, i} = \nabla_\theta J( \theta_{t, i} ) $$A derivative of loss function ($\nabla$, nabla) for every weight ($\theta$, theata) in the neural network with respect to the error function ($J$) at a given time $t$ and input $i$
$$ \theta_{t+1, i} = \theta_{t, i} - \dfrac{\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i} $$Update weights ($\theta$, theata) for given input $i$ and at time/iteration $t$
$\eta$ is a learning rate which is modified for given parameter $\theta_{i}$ at a given time based on previous gradients calculated for given parameter $\theta_{i}$.
We store the sum of the squares of the gradients with respect to $\theta_{i}$ up to time step t, while $\epsilon$ is a smoothing term that avoids division by zero (usually on the order of $1e^{−8}$. Interestingly, without the square root operation, the algorithm performs much worse.
It makes big updates for less frequent parameters and a small step for frequent parameters.
Advantages:
Disadvantages:
It is an extension of AdaGrad which tends to remove the decaying learning rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size $w$. In this exponentially moving average is used rather than the sum of all the gradients.
$$ E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma) g^2_t $$We set $\gamma$ to a similar value as the momentum term, around $0.9$. For clarity, we now rewrite our vanilla SGD update in terms of the parameter update vector $\Delta \theta_{t}$
$$ \Delta \theta_t = - \dfrac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_{t} $$Update the parameters
Advantages:
Disadvantages:
Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adam also keeps an exponentially decaying average of past gradients $M_t$.
$M_t$ and $V_t$ are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradients respectively.
$$ \begin{align} \begin{split} \hat{m}_t &= \dfrac{m_t}{1 - \beta^t_1} \\ \hat{v}_t &= \dfrac{v_t}{1 - \beta^t_2} \end{split} \end{align} $$First and second order of momentum
To update the parameter:
$$ \theta_{t+1} = \theta_{t} - \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$The default values are $0.9$ for $\beta_1$, $0.999$ for $\beta_2$, and $10^{-8}$ for $\epsilon$
Advantages:
Disadvantages:
Comparison between various optimizers
We will use the same dataset we used previously to look at the choice of optimizer and the effects on the models learning.
from keras.optimizers import Adam, Adamax, Adagrad, Adadelta, RMSprop
# set optimizer and learning rates for each
opt0 = SGD(lr=0.01, momentum=0.9)
opt1 = Adam(lr=0.01)
opt2 = Adamax(lr=0.01)
opt3 = Adagrad(lr=0.01)
opt4 = Adadelta(lr=0.01)
opt5 = RMSprop(lr=0.01)
# create dictionary of optimizers to iterate through
opt_dict = {'SGD': opt0,
'Adam': opt1,
'Adamax': opt2,
'Adagrad': opt3,
'Adadelta': opt4,
'RMSprop': opt5,
}
# fit a model and plot learning curve
def fit_model_opt(train_X, train_y, test_X, test_y, n_batch, epochs, opt, opt_name='SGD'):
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=epochs, verbose=0,
batch_size=n_batch)
# evaluate the model
scores = model.evaluate(test_X, test_y)
accuracy = round(scores[1] * 100, 3)
# plot learning curves
plt.plot(history.history['val_accuracy'], label=f'{opt_name} Accuracy = {accuracy}%')
plt.title('Optimizers', pad= -40)
plt.legend(loc=8, ncol=6, bbox_to_anchor=(.5, -0.175))
plt.figure(figsize=(12,6))
for i, opt in zip(range(len(opt_dict.keys())), list(opt_dict.keys())):
# fit model and plot learning curves
fit_model_opt(train_X, train_y, test_X, test_y, 256, epochs=128, opt=opt_dict[opt], opt_name=opt)
# show learning curves
plt.show()
Running the example creates a figure with 6 line plots showing the classification accuracy on the train and test sets of models with optimizers.
The plots show that certain optimizers dont converge. Some result in a more volatile learning process with higher variance in the classification accuracy.
The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process.
The learning rate may be the most important hyperparameter when configuring your neural network. Therefore it is vital to know how to investigate the effects of the learning rate on model performance and to build an intuition about the dynamics of the learning rate on model behavior.
# fit a model and plot learning curve
def fit_model_lr(train_X, train_y, test_X, test_y, n_batch, epochs, opt, lr=None):
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=epochs, verbose=0,
batch_size=n_batch)
# evaluate the model
scores = model.evaluate(test_X, test_y)
accuracy = round(scores[1] * 100, 3)
# plot learning curves
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.title(f'Accuracy = {accuracy}%, Learning Rate = {lr}', pad= -40)
plt.legend()
plt.figure(figsize=(12,10))
rates = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001]
for i in range(len(rates)):
# determine the plot number
plot_no = 420 + (i+1)
plt.subplot(plot_no)
# fit model and plot learning curves
fit_model_lr(train_X, train_y, test_X, test_y, 256, epochs=128, opt=Adam(lr=rates[i]), lr=rates[i])
# show learning curves
plt.tight_layout()
plt.show()
Running the example creates a single figure that contains eight line plots for the eight different evaluated learning rates. Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.
The plots show oscillations in behavior for the too-large learning rate of $1.0$ and the inability of the model to learn anything with the too-small learning rates of $1e^{-6}$ and $1e^{-7}$.
We can see that the model was able to learn the problem well with the learning rates $1e^{-1}$, $1e^{-2}$ and $1e^{-3}$, although successively slower as the learning rate was decreased. With the chosen model configuration, the results suggest a moderate learning rate of $0.01$ results in good model performance on the train and test sets.
Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.
We can adapt the example from the previous section to evaluate the effect of momentum with a fixed learning rate. In this case, we will choose the learning rate of $0.01$ that in the previous section converged to a reasonable solution.
The fit_model()
function can be updated to take a “momentum” argument instead of a learning rate argument, that can be used in the configuration of the SGD class and reported on the resulting plot.
The updated version of this function is listed below.
# fit a model and plot learning curve
def fit_model_mom(train_X, train_y, test_X, test_y, n_batch, epochs, opt, mom=None):
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(3, activation='softmax'))
# compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit model
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=epochs, verbose=0,
batch_size=n_batch)
# evaluate the model
scores = model.evaluate(test_X, test_y)
accuracy = round(scores[1] * 100, 3)
# plot learning curves
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.title(f'Accuracy = {accuracy}%, , momentum = {mom}', pad= -40)
plt.legend()
plt.figure(figsize=(12,5))
mom = [0.0, 0.5, 0.9, 0.99]
for i in range(len(mom)):
# determine the plot number
plot_no = 220 + (i+1)
plt.subplot(plot_no)
# fit model and plot learning curves
fit_model_mom(train_X, train_y, test_X, test_y, 256, epochs=128, opt=SGD(lr=0.08, momentum=mom[i]), mom=mom[i])
# show learning curves
plt.tight_layout()
plt.show()