Train your first Neural Network for Image Captioning using Transfer Learning

Artificial-Intelligence Feb 21, 2021

ImageNet ran a challenge, ILSVRC from 2010 to 2017 during which a huge amount of Large Scale capable algorithms surfaced to push the overall advancements in Computer Vision. Deep Residual Network or ResNet was one of the most interesting models, which ended up winning ILSVRC 2015's classification competition with top-5 error rate of 3.57%.

What are Residual Networks?

Deep Neural Networks contain multiple layers and as the layers increase, the network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model (Vanishing Gradient Problem). To tackle this, like we do by using an updated architecture with RNN (by using LSTM/GRU), the authors handled this by adding a residual network that consists of residual units or blocks which have skip connections, also called identity connections.

ResNet Block

The output of the previous layer is added to the output of the layer after it in the residual block. The hop or skip could be 1, 2 or even 3. When adding, the dimensions of x may be different than F(x) due to the convolution process, resulting in a reduction of its dimensions. Thus, we add an additional x convolution layer to change the dimensions of x.

What is Transfer Learning?

Transfer Learning is a popular approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks given the vast compute and time resources required to develop neural network models on these problems and from the huge jumps in skill that they provide on related problems.

Let's get into the tutorial!

For this blog, we will train a neural network which will be the Decoder of the complete Encoder-Decoder that we will make use of. Simply, the ResNe-50 pre-trained model will be used in order to encode the images as the Encoder in the dataset. The ResNet model trained on ImageNet will provide necessary data that will help in captioning. Following this, we will be making use of encoded images and the captions in the dataset to the Decoder Network and train it in order to achieve our objective of Image Captioning.

We will be training the model on Kaggle and using the Flickr Dataset.

Step 1:

Import ResNet50 Model and Encode the Images from the Dataset and save it into a .pkl file.

train_path = '../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Flickr_TextData/Flickr_8k.trainImages.txt'
x_train = open(train_path, 'r').read().split("\n")
images_path = '../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Images/'

from IPython.core.display import display, HTML
display(HTML("""<a href="">ResNet50 Architecture</a>"""))
model = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')

train_data = {}
for ix in x_train:
    if ix == "":
    if ctr >= 3000:
    if ctr%1000==0:
    path = images_path + ix
    img = preprocessing(path)
    pred = model.predict(img).reshape(2048)
    train_data[ix] = pred

Step 2:

Preprocess the captions text by splitting up sentences and storing the words as sequences in order for our Decoder Network.

from keras.preprocessing import image, sequence

padded_sequences, subsequent_words = [], []
pd_dataset = pd.read_csv("flickr_8k_train_dataset.txt", delimiter='\t')
ds = pd_dataset.values
# Storing all the captions from ds into a list
sentences = []
for ix in range(ds.shape[0]):
    sentences.append(ds[ix, 1])
words = [i.split() for i in sentences]

unique = []
for i in words:
unique = list(set(unique))

vocab_size = len(unique)

word_2_indices = {val:index for index, val in enumerate(unique)}
indices_2_word = {index:val for index, val in enumerate(unique)}

word_2_indices['UNK'] = 0
word_2_indices['raining'] = 8253

indices_2_word[0] = 'UNK'
indices_2_word[8253] = 'raining'

vocab_size = len(word_2_indices.keys())

for ix in range(ds.shape[0]):
    partial_seqs = []
    next_words = []
    text = ds[ix, 1].split()
    text = [word_2_indices[i] for i in text]
    for i in range(1, len(text)):
    padded_partial_seqs = sequence.pad_sequences(partial_seqs, max_len, padding='post')

    next_words_1hot = np.zeros([len(next_words), vocab_size], dtype=np.bool)
    for i,next_word in enumerate(next_words):
        next_words_1hot[i, next_word] = 1
padded_sequences = np.asarray(padded_sequences)
subsequent_words = np.asarray(subsequent_words)

for ix in range(len(padded_sequences[0])):
    for iy in range(max_len):
num_of_images = 2000
captions = np.zeros([0, max_len])
next_words = np.zeros([0, vocab_size])

for ix in range(num_of_images):#img_to_padded_seqs.shape[0]):
    captions = np.concatenate([captions, padded_sequences[ix]])
    next_words = np.concatenate([next_words, subsequent_words[ix]])"captions.npy", captions)"next_words.npy", next_words)

Step 3:

Use the saved preprocessed Images and Captions files and train the model. For the Decoder, we will use a single Dense Layer for the Image model to process the Encoded Images and a LSTM model for Language Model to process the captions.

embedding_size = 128
max_len = 40

image_model = Sequential()

image_model.add(Dense(embedding_size, input_shape=(2048,), activation='relu'))


language_model = Sequential()

language_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=max_len))
language_model.add(LSTM(256, return_sequences=True))

conca = Concatenate()([image_model.output, language_model.output])
x = LSTM(128, return_sequences=True)(conca)
x = LSTM(512, return_sequences=False)(x)
x = Dense(vocab_size)(x)
out = Activation('softmax')(x)
model = Model(inputs=[image_model.input, language_model.input], outputs = out)

# model.load_weights("../input/model_weights.h5")
model.compile(loss='categorical_crossentropy', optimizer='RMSprop', metrics=['accuracy'])

hist =[images, captions], next_words, batch_size=512, epochs=200)


Step 4:

Predict the results by passing an image, encoding it with ResNet50 and passing it to the trained model.

def preprocessing(img_path):
    im = image.load_img(img_path, target_size=(224,224,3))
    im = image.img_to_array(im)
    im = np.expand_dims(im, axis=0)
    return im
def get_encoding(model, img):
    image = preprocessing(img)
    pred = model.predict(image).reshape(2048)
    return pred

def predict_captions(image):
    start_word = ["<start>"]
    while True:
        par_caps = [word_2_indices[i] for i in start_word]
        par_caps = sequence.pad_sequences([par_caps], maxlen=max_len, padding='post')
        preds = model.predict([np.array([image]), np.array(par_caps)])
        word_pred = indices_2_word[np.argmax(preds[0])]
        if word_pred == "<end>" or len(start_word) > max_len:
    return ' '.join(start_word[1:-1])
resnet = ResNet50(include_top=False,weights='imagenet',input_shape=(224,224,3),pooling='avg')

img = "../input/flickr8k/145129_343604_upload_Flickr_Data/Flickr_Data/Images/1072153132_53d2bb1b60.jpg"

test_img = get_encoding(resnet, img)

Argmax_Search = predict_captions(test_img)

z = Image(filename=img)

Prediction made by the model

With the tutorial above, we saw how we can use a pre-trained model and use it to inherit important information which can be used to encode the Images (which if went normally would have taken too much time to train and a good amount of compute resources) resulting in a model that has decent Image Captioning capabilities.


Checkout the full notebook here:

Image Captioning using ResNet
Explore and run machine learning code with Kaggle Notebooks | Using data from flickr_data


Checkout Hugging Face to search pre-trained models that can be used in a similar fashion towards another applications leveraging Transfer Learning.

Hugging Face – On a mission to solve NLP, one commit at a time.
We’re on a journey to solve and democratize artificial intelligence through natural language.