Train your first Multimodal Learning Model

Artificial-Intelligence Dec 16, 2020

In the past few months since I have published my first research paper on Deep Learning, I started looking into the ongoing advancements in order to look for an area in AI where I could make a significant contribution sometime in the future. Upon further research, I stumbled on SoundNet which I found extremely interesting. The concept that SoundNet is based on, in general, is called Multimodal Learning.

TL;DR - Checkout the public notebook.

So now, what is Multimodal Learning? (I'll give you the layman version. However, you can choose to refer this awesome seminar by Victoria Dean for deep insights!)

Suppose you are helping your friend X at a restaurant to pick out Biryani. Now, the moment X asks for Biryani, an image of Biryani pops in your head and you are able to point it out in the restaurant's food display. (For this scenario, we are at a restaurant's buffet which does not have served dishes labelled & X does not know what Biryani looks or tastes like! - X might be an alien using a human body as trojan horse.)

Biryani - Photo by Atikah Akhtar / Unsplash

Now you can do so, because you have heard what Biryani (the word) sounds like & means, and have seen what a Biryani looks like. You are able to "fuse" these two informations and have gained the correct perception of what needs to be done and therefore, are able to successfully point out Biryani at the restaurant.

The two inputs or "modes" are namely: Audio (hearing the term 'Biryani') and Visual('Identify Biryani by seeing it'). Multimodal learning is a good model to represent the joint representations of different modalities. (in this case Audio and Visual contexts which also happens to be my point of interest in Multimodal Learning)

Multimodal Learning Models comprise of different modalities and their models are trained in following steps:

1) Train a model for one modality.

2) Train another model for the other modality(/modalities).

3) Decision Fusion with the above trained models.

Now, let's get on to the fun stuff!

For this tutorial, we are going to train a Multimodal Learning model with Images and Text modalities with the Movies Dataset which I have already pre-processed and is available on Kaggle for direct usage. As this tutorial is aimed for a short introduction to Multimodal learning, I'll be using Keras for writing the model.

Dataset & Preprocessing

The original dataset consists overviews of Movies, their poster URLs on TMDB website and genres which they fall under. The objective of this model is to correctly predict the genres of movies from the test set. I also took the liberty of dividing this preprocessed data into training, validation and testing sets.

Highlights from the (Original) Dataset:

1) Movie Genres and number of titles within that genre(s) - Action: 6596, Adventure: 3496, Animation: 1935 Comedy: 13182, Crime: 4307, Documentary: 3932, Drama: 20265, Family: 2770, Fantasy: 2313, History: 1398, Horror: 4673, Music: 1598, Mystery: 2467, Romance: 6735, Science Fiction: 3049, Thriller: 7624, War: 1323, Western: 1042

2) The movies in the dataset are released before July 2017.

3) Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

The dataset is preprocessed with posters images downloaded and resized along with removing punctuations from the reviews and using one-hot encoding for handling labels. Additionally, I make use of Word2Vec model for word-to-word relations and contexts for making the embedding matrix of movie overviews in the dataset.

def get_embedding_matrix(typeToLoad):
  if typeToLoad == "glove":
    embed_size = 100
  elif typeToLoad == "word2vec":
    word2vecDict = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
    embed_size = 300
  elif typeToLoad == "fasttext":
    embed_size = 300

  if typeToLoad == "glove" or typeToLoad == "fasttext":
    embeddings_index = dict()
    f = open(EMBEDDING_FILE)
    for line in f:
      values = line.split()
      word = values[0]
      coefs = np.asarray(values[1:], dtype='float32')
      embeddings_index[word] = coefs
    print("Loaded " + str(len(embeddings_index)) + " word vectors.")
    embeddings_index = dict()
    for word in word2vecDict.wv.vocab:
      embeddings_index[word] = word2vecDict.word_vec(word)
    print("Loaded " + str(len(embeddings_index)) + " word vectors.")
  embedding_matrix = 1 * np.random.randn(len(word_index)+1, embed_size)

  embeddedCount = 0
  for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
      embedding_matrix[i] = embedding_vector
  print("total embedded:", embeddedCount, "common words")
  return embedding_matrix
Embedding Matrix Function used

Steps & Architecture:

1) First model is CNN for analysing Movie Posters (Resized Images)

2)  Second model is LSTM for analysing Movie Overviews (Text) respectively.

3) The two model's outputs are concatenated paired with two dense layers for getting the final output.

Model Architecture
def compile_model(embedding_matrix):
  lstm_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
  x = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], mask_zero=True, input_length=MAX_SEQUENCE_LENGTH, trainable=False)(lstm_input)
  x = Dropout(0.3)(x)
  x = LSTM(64, return_sequences = True)(x)
  x = Dropout(0.3)(x)
  x = LSTM(64)(x)
  x = Dropout(0.3)(x)
  lstm_out = Dense(18, activation = 'relu')(x)

  print(X_img_train.shape[1], X_img_train.shape[2], X_img_train.shape[3])
  cnn_input = Input(shape=(X_img_train.shape[1], X_img_train.shape[2], X_img_train.shape[3]))
  y = Conv2D(32, (3, 3), activation='relu', input_shape=(X_img_train.shape[1], X_img_train.shape[2], X_img_train.shape[3]))(cnn_input)
  y = MaxPooling2D(2, 2)(y)
  y = Conv2D(64, (3, 3), activation='relu')(y)
  y = MaxPooling2D(2, 2)(y)
  y = Conv2D(128, (3, 3), activation='relu')(y)
  y = MaxPooling2D(2, 2)(y)
  y = Conv2D(128, (3, 3), activation='relu')(y)
  y = MaxPooling2D(2, 2)(y)
  y = Flatten()(y)
  y = Dropout(0.3)(y)
  cnn_out = Dense(512, activation='relu')(y)

  concat_inp = concatenate([cnn_out, lstm_out])
  z = Dense(256, activation='relu')(concat_inp)
  z = Dropout(0.3)(z)
  z = Dense(128, activation='relu')(z)
  z = Dropout(0.3)(z)
  output = Dense(train_labels.shape[1], activation='sigmoid')(z)

  model = Model(inputs=[cnn_input, lstm_input], outputs=[output])
  adam = Adam(lr=0.001, decay=1e-5)
  model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
  return model
Model's Code Snippet


We are going to train this model on Kaggle (with GPU Acceleration as we have a CNN model that trains faster on GPUs)as it will not use your computer's resources and you can schedule the models to be trained with no need of interference for 24 hours. (And yes, just like almost everything I prefer, it can be used for free!)


After training the model and running predictions it was time to evaluate the performance, which is usually done on metrics like: Hamming Loss, ROC-AUC Score, F1-Score, Precision, Recall, Mean Accuracy (for multi-class classifications). Here is how the model performed on 18 Genres (Classes):

Hamming Loss: 0.10267838718996326

ROC-AUC Score: 0.7663174542812823

F1-Score: 0.334927670277880

Precision: 0.3444566707014232

Recall: 0.4591912230770563

Mean Accuracy: 0.7926347791396695