Language translation is one of the most common and exhaustively used application of Machine Learning. This application is also widely taught as an introduction during machine learning courses but the general usage is through industrial platforms like Google Translate, Baidu Fanyi, Samsung Bixby and many more since they are highly accurate and boast models with arcane architectures such that each component within the model is used for a specific purpose and provides good generalisation performance over a number use of cases.
The main reason behind this is that these models are highly computational expensive while training and have to be trained on vast amounts of data, so, they ultimately end up being offered from large corporations. Now, we also have open source packages like nltk, CoreNLP etc., that help facilitate development of a model that can mimic the accuracy of ones offered in tools like those offered by Google and Baidu. Now, this is where things get interesting: In 2016, Google's Yonghui Wu and his team published the paper on Google's Neural Machine Translation System or GNMT, which explains the working of the currently being used model in Google Translate. The model is based on a Encoder-Decoder architecture which builds an Encoder with Bidirectional LSTM and LSTM layers and the Decoder with LSTM layers with an Attention Layer in between:
The encoder-decoder architecture for recurrent neural networks is achieving state-of-the-art results on standard machine translation benchmarks and is being used in the heart of industrial translation services. The model is simple, but given the large amount of data required to train it, tuning the myriad of design decisions in the model in order get top performance on your problem can be practically intractable.
In this tutorial, we will train an Encoder-Decoder style model with Bidirectional LSTM layers on Tatoeba Project's German to English terms dataset used as the basis for flashcards for language learning that contains 152,820 pairs of English to German phases, one pair per line with a tab separating the language using Keras. Find the dataset here.
For the scope of this tutorial, we will not discuss the specifics of Preprocessing, however, following are the steps taken to preprocess the dataset: (Also, the git repository will contain Jupyter notebook used to preprocess the dataset)
- Clean text from punctuations and pair the German phrases with their English translated forms.
- The file contains 150,000+ pairs, as the computational cost will be high for training, we will only take first 10,00 pairs and split them into 9,000 for training and 1,000 for testing.
Tokenizing and Encoding the pairs to make the pairs training ready:
We can use the Keras Tokenize class to map words to integers, as needed for modelling. We will use separate tokenizer for the English sequences and the German sequences. The function below-named create_tokenizer() will train a tokenizer on a list of phrases.
def create_tokenizer(lines): tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer def max_length(lines): return max(len(line.split()) for line in lines) def encode_sequences(tokenizer, length, lines): X = tokenizer.texts_to_sequences(lines) X = pad_sequences(X, maxlen=length, padding='post') return X def encode_sequences(tokenizer, length, lines): X = tokenizer.texts_to_sequences(lines) X = pad_sequences(X, maxlen=length, padding='post') return X def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = array(ylist) y = y.reshape(sequences.shape, sequences.shape, vocab_size) return y
Each input and output sequence must be encoded to integers and padded to the maximum phrase length.
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(Bidirectional(LSTM(n_units))) model.add(RepeatVector(tar_timesteps)) model.add(Bidirectional(LSTM(n_units, return_sequences=True))) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model
The above model (in this tutorial) is trained for 30 epochs with a batch size of 64. After prediction, we can make use of word_for_id function in order to see the translated English sentence for the German phrase inputted:
def word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None def predict_sequence(model, tokenizer, source): prediction = model.predict(source, verbose=0) integers = [argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target)
Evaluating the model:
We will be making use of BLEU, or the Bilingual Evaluation Understudy which is a score for comparing a candidate translation of text to one or more reference translations.
def evaluate_model(model, tokenizer, sources, raw_dataset): actual, predicted = list(), list() for i, source in enumerate(sources): source = source.reshape((1, source.shape)) translation = predict_sequence(model, eng_tokenizer, source) raw_target, raw_src = raw_dataset[i] if i < 10: print('src=[%s], target=[%s], predicted=[%s]' % (raw_src, raw_target, translation)) actual.append([raw_target.split()]) predicted.append(translation.split()) print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0))) print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0))) print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0))) print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))
Following are the results obtained after training and testing the model:
Test BLEU Scores and Samples src=[lasst uns anfangen], target=[lets start], predicted=[lets begin] src=[tom hat nachgegeben], target=[tom relented], predicted=[tom has] src=[hor mal kurz zu], target=[listen a minute], predicted=[sweet one] src=[tom war gelangweilt], target=[tom was bored], predicted=[tom was bored] src=[ich habe die wette verloren], target=[i lost the bet], predicted=[i lost the bet] src=[du bist betrunken], target=[you are drunk], predicted=[youre drunk] src=[das tut weh], target=[that hurts], predicted=[its hurts] src=[bitte nimm dir einen], target=[please take one], predicted=[please take one] src=[es ist heute hei], target=[its hot today], predicted=[its hot today] src=[dieser hund gefallt mir], target=[i like this dog], predicted=[i like this car] BLEU-1: 0.588955 BLEU-2: 0.474139 BLEU-3: 0.400485 BLEU-4: 0.239205
The BLEU-4 score obtained was: 0.239205 (which is not that good, but remember we only trained this on 10,000 examples and 30 epochs), you can always tweak the model and try out different combinations which I would absolutely love to know.(Maybe in the comments down below! 😉)
The code for this tutorial (including the preprocessing Jupyter notebook) can be found below:
You can choose to train this model or your own configuration with all 150,000 phrase pairs on Kaggle using Run All and Commit which will keep training your model without any intervention required.
Checkout an example of Implementation of GNMT using TensorFlow below:
Read the complete paper on GNMT below:
P.S. - This blog entry is more than a week late as the blog I was working on for last week's entry is still in progress, and yeah, it's going to be a big one! So, stay tuned.