Train your first Neural Network for Optical Character Recognition

Deep-Learning Apr 16, 2021

Back in the 1980s, National Institute of Standards and Technology (NIST) started collecting a large database of handwritten digits by sending out forms to be filled out by Census Bureau employees (training dataset) and American high school students (testing dataset). They were then normalized to fit into a 28x28 pixel bounding box and anti-aliased. This contributed to the NIST dataset which was not well-suited for machine learning experiments. The popular dataset used for "Hello World" type tutorials for Deep Learning is: MNIST or Modified National Institute of Standards and Technology dataset. MNIST's half of the training set and half of the test set were taken from NIST's training dataset, while the other half of the training set and the other half of the test set were taken from NIST's testing dataset. The MNIST database contains 60,000 training images and 10,000 testing images.  [Learn more about history of MNIST here.]

American election mail envelope with face mask
Photo by Tiffany Tertipes / Unsplash

Yann LeCun’s Convolutional Neural Network architecture (also known as LeNet-5) was used by the American Post office to automatically identify handwritten zip code numbers. The model was trained on MNIST dataset. Process of recognition of characters from images is called Optical Character Recognition or OCR.

Let's get into the tutorial!

While tutorials for training a model on 0-9 digit images from MNIST database can be found and is done by everyone as a starter tutorial, in this tutorial, we will kick things up a notch and train a Neural Network to do OCR for captcha images which contain digits from 0-9 and lowercase alphabets from a-z.  The dataset can be found on Kaggle. This dataset contains CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) images. They have since then been replace by reCAPTCHA because they are breakable using Artificial Intelligence. We will be building the model that will basically follow 3 steps:

  1. Encode the images to labels with text.
  2. Train the Neural Network to maps relevant labels of text with the images.
  3. Decode the labels to form the predicted text such that the labels are taken in one-by-one with the most likely character per step.

The Model:

The NN-training will be guided by a CNN-BiLSTM layered model followed by a CTC Layer guided by the CTC loss function. We only feed the output matrix of the NN and the corresponding ground-truth (GT) text to the CTC loss function. But how does it know where each character occurs? Well, it does not know. Instead, it tries all possible alignments of the GT text in the image and takes the sum of all scores. This way, the score of a GT text is high if the sum over the alignment-scores has a high value.

Example of annotated image

For encoding repeating characters, we introduce a pseudo-character called "-" (blank) in the following text. We use a clever coding schema to solve the duplicate-character problem: when encoding a text, we can insert arbitrary many blanks at any position, which will be removed when decoding it. However, we must insert a blank between duplicate characters like in “hello” (Eg. of encoded text: "--—hhhhheeel-l-o") [For the scope of this tutorial, we will not get into the exact procedure of CTC step-by-step.]

Step 1:

Import the dataset and preprocess by scaling down the images by 4x.

data_dir = Path("../input/captcha-version-2-images/samples/")

images = sorted(list(map(str, list(data_dir.glob("*.png")))))
labels = [img.split(os.path.sep)[-1].split(".png")[0] for img in images]
characters = set(char for label in labels for char in label)

print("Number of images found: ", len(images))
print("Number of labels found: ", len(labels))
print("Number of unique characters: ", len(characters))
print("Characters present: ", characters)

batch_size = 16

img_width = 200
img_height = 50

char_to_num = layers.experimental.preprocessing.StringLookup(
    vocabulary=list(characters), num_oov_indices=0, mask_token=None

# Mapping integers back to original characters
num_to_char = layers.experimental.preprocessing.StringLookup(
    vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True

def split_data(images, labels, train_size=0.9, shuffle=True):
    # 1. Get the total size of the dataset
    size = len(images)
    # 2. Make an indices array and shuffle it, if required
    indices = np.arange(size)
    if shuffle:
    # 3. Get the size of training samples
    train_samples = int(size * train_size)
    # 4. Split data into training and validation sets
    x_train, y_train = images[indices[:train_samples]], labels[indices[:train_samples]]
    x_valid, y_valid = images[indices[train_samples:]], labels[indices[train_samples:]]
    return x_train, x_valid, y_train, y_valid

# Splitting data into training and validation sets
x_train, x_valid, y_train, y_valid = split_data(np.array(images), np.array(labels))

def encode_single_sample(img_path, label):
    # 1. Read image
    img =
    # 2. Decode and convert to grayscale
    img =, channels=1)
    # 3. Convert to float32 in [0, 1] range
    img = tf.image.convert_image_dtype(img, tf.float32)
    # 4. Resize to the desired size
    img = tf.image.resize(img, [img_height, img_width])
    # 5. Transpose the image because we want the time
    # dimension to correspond to the width of the image.
    img = tf.transpose(img, perm=[1, 0, 2])
    # 6. Map the characters in label to numbers
    label = char_to_num(tf.strings.unicode_split(label, input_encoding="UTF-8"))
    # 7. Return a dict as our model is expecting two inputs
    return {"image": img, "label": label}
train_dataset =, y_train))
train_dataset = (

validation_dataset =, y_valid))
validation_dataset = (

Step 2: Define the CTC Layer

class CTC(layers.Layer):
    def __init__(self, name=None):
        self.loss_fn = keras.backend.ctc_batch_cost

    def call(self, y_true, y_pred):
        # Compute the training-time loss value and add it
        # to the layer using `self.add_loss()`.
        batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
        input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
        label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

        input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
        label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")

        loss = self.loss_fn(y_true, y_pred, input_length, label_length)

        # At test time, just return the computed predictions
        return y_pred

Step 3: Build the Model

input_img = layers.Input(
    shape=(img_width, img_height, 1), name="image", dtype="float32"
labels = layers.Input(name="label", shape=(None,), dtype="float32")

# First conv block
x = layers.Conv2D(
    (3, 3),
x = layers.MaxPooling2D((2, 2), name="pool1")(x)

# Second conv block
x = layers.Conv2D(
    (3, 3),
x = layers.MaxPooling2D((2, 2), name="pool2")(x)

# We have used two max pool with pool size and strides 2.
# Hence, downsampled feature maps are 4x smaller. The number of
# filters in the last layer is 64. Reshape accordingly before
# passing the output to the RNN part of the model
new_shape = ((img_width // 4), (img_height // 4) * 64)
x = layers.Reshape(target_shape=new_shape, name="reshape")(x)
x = layers.Dense(64, activation="relu", name="dense1")(x)
x = layers.Dropout(0.2)(x)

# RNNs
x = layers.Bidirectional(layers.LSTM(128, return_sequences=True, dropout=0.25))(x)
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.25))(x)

# Output layer
x = layers.Dense(len(characters) + 1, activation="softmax", name="dense2")(x)

# Add CTC layer for calculating CTC loss at each step
output = CTCLayer(name="ctc_loss")(labels, x)

# Define the model
model = keras.models.Model(
    inputs=[input_img, labels], outputs=output, name="ocr_model_v1"
# Optimizer
opt = keras.optimizers.Adam()
# Compile the model and return

Step 4: Train the Model

epochs = 100
early_stopping_patience = 10

early_stopping = keras.callbacks.EarlyStopping(
    patience=early_stopping_patience, restore_best_weights=True

history =

Step 5: Test the Model

prediction_model = keras.models.Model(
    model.get_layer(name="image").input, model.get_layer(name="dense2").output

# A utility function to decode the output of the network
def decode_batch_predictions(pred):
    input_len = np.ones(pred.shape[0]) * pred.shape[1]
    # Use greedy search. For complex tasks, you can use beam search
    results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
        :, :max_length
    # Iterate over the results and get back the text
    output_text = []
    for res in results:
        res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
    return output_text

acc_score = 0

#  Let's check results on some validation samples
for batch in validation_dataset.take(1):
    batch_images = batch["image"]
    batch_labels = batch["label"]

    preds = prediction_model.predict(batch_images)
    pred_texts = decode_batch_predictions(preds)
    m = len(pred_texts)
    orig_texts = []
    for label in batch_labels:
        label = tf.strings.reduce_join(num_to_char(label)).numpy().decode("utf-8")

    _, ax = plt.subplots(4, 4, figsize=(15, 5))
    for i in range(len(pred_texts)):
        img = (batch_images[i, :, :, 0] * 255).numpy().astype(np.uint8)
        img = img.T
        title = f"Prediction: {pred_texts[i]}"
        if str(pred_texts[i]) == orig_texts[i]:
        ax[i // 4, i % 4].imshow(img, cmap="gray")
        ax[i // 4, i % 4].set_title(title)
        ax[i // 4, i % 4].axis("off")

Step 6: Validation Score

print("Validation Score: " + str(acc_score/m * 100))
Validation Score: 93.75

Sometimes, the model reaches 100% accuracy and sometimes it does not.

Find the public notebook below:

OCR for Captchas
Explore and run machine learning code with Kaggle Notebooks | Using data from CAPTCHA Images


For this tutorial, we have a very limited character set:

Characters present:  {'m', 'e', '3', '8', 'y', '6', 'x', 'b', 'f', 'w', 'd', '5', 'n', 'p', 'c', '2', 'g', '4', '7'}

You can choose to try OCR for captchas containing all the characters. For the scope of this blog, we used limited characters to keep the model training fast and its results, easy to reproduce!


You can learn the intuition and idea behind CTC or Connectionist Temporal Classification from below:

An Intuitive Explanation of Connectionist Temporal Classification
If you want a computer to recognize text, neural networks (NN) are a good choice as they outperform all other approaches at the moment. The NN for such use-cases usually consists of convolutional…