Train your first Optimized Deep Learning Model

Artificial-Intelligence Dec 25, 2020

While perusing the Deep Learning book by Ian Goodfellow in my search for research topics, I stumbled on Optimization in Deep Learning and it happened to entice my interests.

TL;DR: Checkout the public notebook on Kaggle.

So, the question is: What is Optimization and why is it relevant for Deep Learning?

Out of all the drawbacks of traditional deep learning, the most common of them is: Hungry for Data. Now one can imagine, these data hungry models also require a lot of training time and it only grows as number of layers increase in a Deep Learning Model. For example, the model from my last blog took around 6 hours to train as it was a model with a complex architecture involving both CNN and LSTM with images and text data.

Consider the below equation with the cost function J, Weights of the network W, bias(es) b, mean loss L between predicted value y' and actual value y for m training examples iterated over i.

Cost Function - J

The weights and biases are initialised with random values (generally), the techniques used to iterate have the objective of minimising the cost function. As forward propagation goes on, the weights and biases are used to compute the cost function and through iteration, over time, the cost function is minimised to give the optimized value of weights and biases. These algorithms are where optimization and optimizers come in!

Some of the most common optimizer algorithms used are: Stochastic Gradient Descent (SGD), Momentum, Adagrad, Adam and RMSProp.

A GIF that visualises loss surface & relative time taken to come at the most optimum solution (where Adadelta rules)

For this scope of this tutorial (targeted as an introductory blog entry), we are going to train a CNN Model with a RMS Prop Optimizer on CIFAR 10 dataset using Keras.

The Dataset: (straight from the publisher's website)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The 10 classes are: Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, and Truck.

The Optimizer:

RMS Prop stands for Root-Mean-Square Propagation and first proposed by Geoff Hinton in lecture 6 of the online course: Neural Networks for Machine Learning. It is a technique to dampen out the motion in the y-axis and is much faster as compared to gradient descent.

RMSProp Visualisation

Now let's have a look at the model:

model = Sequential()

#Layer 1&2 - CNN
model.add(Conv2D(128,(3,3), padding = 'same', input_shape = x_train.shape[1:]))
model.add(Activation('elu'))
model.add(Conv2D(128, (3,3)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

#Layer 3&4 - CNN
model.add(Conv2D(256, (3, 3), padding='same'))
model.add(Activation('elu'))
model.add(Conv2D(256, (3, 3)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

#Layer 5&6 - CNN
model.add(Conv2D(512, (3, 3), padding='same'))
model.add(Activation('elu'))
model.add(Conv2D(512, (3, 3)))
model.add(Activation('elu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

#Layer 7&8 - Fully Connected Layers
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('elu'))
model.add(Dropout(0.5))
model.add(Dense(classes))
model.add(Activation("softmax"))

#RMSProp optimizer with Learning Rate = 0.0001
opt = keras.optimizers.RMSprop(lr=0.0001, decay = 1e-6)

model.compile(loss = 'categorical_crossentropy', optimizer=opt, metrics = ['accuracy'])

The model with RMSProp Optimizer trains (with GPU Acceleration) with each epoch taking around 16 seconds as compared to 18 seconds each for the same model without any optimizer. (Can be checked in Saved Versions of the notebook given above).

Upon completion, after training for 120 epochs, the model gives an accuracy of 85.89% on the testing set! (For the scope of this tutorial, we have not included a separate test set and used the validation set as the benchmark for Accuracy. Refer to this for more information!)

[Bonus]

If you are interested in learning the basics of Optimization of Deep Learning with elaborate descriptions on the optimization techniques mentioned above, you can refer to the below link:

https://www.deeplearningbook.org/contents/optimization.html

Merry Christmas!

Cheers!

Tags

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.