Train your first Neural Network for Image Captioning using Transfer Learning
ImageNet ran a challenge, ILSVRC from 2010 to 2017 during which a huge amount of Large Scale capable algorithms surfaced to push the overall advancements in Computer Vision. Deep Residual Network or ResNet was one of the most interesting models, which ended up winning ILSVRC 2015's classification competition with top-5 error rate of 3.57%.
What are Residual Networks?
Deep Neural Networks contain multiple layers and as the layers increase, the network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model (Vanishing Gradient Problem). To tackle this, like we do by using an updated architecture with RNN (by using LSTM/GRU), the authors handled this by adding a residual network that consists of residual units or blocks which have skip connections, also called identity connections.

The output of the previous layer is added to the output of the layer after it in the residual block. The hop or skip could be 1, 2 or even 3. When adding, the dimensions of x may be different than F(x) due to the convolution process, resulting in a reduction of its dimensions. Thus, we add an additional x convolution layer to change the dimensions of x.
What is Transfer Learning?
Transfer Learning is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems.
Let's get into the tutorial!
For this blog, we will train a neural network which will be the Decoder of the complete Encoder-Decoder that we will make use of. Simply, the ResNe-50 pre-trained model will be used in order to encode the images as the Encoder in the dataset. The ResNet model trained on ImageNet will provide necessary data that will help in captioning. Following this, we will be making use of encoded images and the captions in the dataset to the Decoder Network and train it in order to achieve our objective of Image Captioning.
We will be training the model on Kaggle and using the Flickr Dataset.
Step 1:
Import ResNet50 Model and Encode the Images from the Dataset and save it into a .pkl file.
train_path = '../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Flickr_TextData/Flickr_8k.trainImages.txt'
x_train = open(train_path, 'r').read().split("\n")
images_path = '../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Images/'
from IPython.core.display import display, HTML
display(HTML("""<a href="http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006">ResNet50 Architecture</a>"""))
model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
model.summary()
train_data = {}
ctr=0
for ix in x_train:
if ix == "":
continue
if ctr >= 3000:
break
ctr+=1
if ctr%1000==0:
print(ctr)
path = images_path + ix
img = preprocessing(path)
pred = model.predict(img).reshape(2048)
train_data[ix] = pred
Step 2:
Preprocess the captions text by splitting up sentences and storing the words as sequences in order for our Decoder Network.
from keras.preprocessing import image, sequence
padded_sequences, subsequent_words = [], []
pd_dataset = pd.read_csv("flickr_8k_train_dataset.txt", delimiter='\t')
ds = pd_dataset.values
print(ds.shape)
# Storing all the captions from ds into a list
sentences = []
for ix in range(ds.shape[0]):
sentences.append(ds[ix, 1])
words = [i.split() for i in sentences]
unique = []
for i in words:
unique.extend(i)
unique = list(set(unique))
vocab_size = len(unique)
word_2_indices = {val:index for index, val in enumerate(unique)}
indices_2_word = {index:val for index, val in enumerate(unique)}
word_2_indices['UNK'] = 0
word_2_indices['raining'] = 8253
indices_2_word[0] = 'UNK'
indices_2_word[8253] = 'raining'
vocab_size = len(word_2_indices.keys())
for ix in range(ds.shape[0]):
partial_seqs = []
next_words = []
text = ds[ix, 1].split()
text = [word_2_indices[i] for i in text]
for i in range(1, len(text)):
partial_seqs.append(text[:i])
next_words.append(text[i])
padded_partial_seqs = sequence.pad_sequences(partial_seqs, max_len, padding='post')
next_words_1hot = np.zeros([len(next_words), vocab_size], dtype=np.bool)
#Vectorization
for i,next_word in enumerate(next_words):
next_words_1hot[i, next_word] = 1
padded_sequences.append(padded_partial_seqs)
subsequent_words.append(next_words_1hot)
padded_sequences = np.asarray(padded_sequences)
subsequent_words = np.asarray(subsequent_words)
for ix in range(len(padded_sequences[0])):
for iy in range(max_len):
print(indices_2_word[padded_sequences[0][ix][iy]],)
print("\n")
num_of_images = 2000
captions = np.zeros([0, max_len])
next_words = np.zeros([0, vocab_size])
for ix in range(num_of_images):#img_to_padded_seqs.shape[0]):
captions = np.concatenate([captions, padded_sequences[ix]])
next_words = np.concatenate([next_words, subsequent_words[ix]])
np.save("captions.npy", captions)
np.save("next_words.npy", next_words)
Step 3:
Use the saved preprocessed Images and Captions files and train the model. For the Decoder, we will use a single Dense Layer for the Image model to process the Encoded Images and a LSTM model for Language Model to process the captions.
embedding_size = 128
max_len = 40
image_model = Sequential()
image_model.add(Dense(embedding_size, input_shape=(2048,), activation='relu'))
image_model.add(RepeatVector(max_len))
image_model.summary()
language_model = Sequential()
language_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_len))
language_model.add(LSTM(256, return_sequences=True))
language_model.add(TimeDistributed(Dense(embedding_size)))
conca = Concatenate()([image_model.output, language_model.output])
x = LSTM(128, return_sequences=True)(conca)
x = LSTM(512, return_sequences=False)(x)
x = Dense(vocab_size)(x)
out = Activation('softmax')(x)
model = Model(inputs=[image_model.input, language_model.input], outputs = out)
# model.load_weights("../input/model_weights.h5")
model.compile(loss='categorical_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
model.summary()
hist = model.fit([images, captions], next_words, batch_size=512, epochs=200)
model.save_weights("model_weights.h5")
Step 4:
Predict the results by passing an image, encoding it with ResNet50 and passing it to the trained model.
def preprocessing(img_path):
im = image.load_img(img_path, target_size=(224,224,3))
im = image.img_to_array(im)
im = np.expand_dims(im, axis=0)
return im
def get_encoding(model, img):
image = preprocessing(img)
pred = model.predict(image).reshape(2048)
return pred
def predict_captions(image):
start_word = ["<start>"]
while True:
par_caps = [word_2_indices[i] for i in start_word]
par_caps = sequence.pad_sequences([par_caps], maxlen=max_len, padding='post')
preds = model.predict([np.array([image]), np.array(par_caps)])
word_pred = indices_2_word[np.argmax(preds[0])]
start_word.append(word_pred)
if word_pred == "<end>" or len(start_word) > max_len:
break
return ' '.join(start_word[1:-1])
resnet = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')
img = "../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Images/1072153132_53d2bb1b60.jpg"
test_img = get_encoding(resnet, img)
Argmax_Search = predict_captions(test_img)
z = Image(filename=img)
display(z)
print(Argmax_Search)

With the tutorial above, we saw how we can use a pre-trained model and use it to inherit important information which can be used to encode the Images (which if went normally would have taken too much time to train and a good amount of compute resources) resulting in a model that has decent Image Captioning capabilities.
[Optional]
Checkout the full notebook here:

[Bonus]
Checkout Hugging Face to search pre-trained models that can be used in a similar fashion towards another applications leveraging Transfer Learning.

Cheers!