Try your first Neural Network for Neural Information Retrieval
Neural Information Retrieval (NIR) is an emerging research field that combines traditional Information Retrieval (IR) techniques with neural network models to improve the effectiveness and efficiency of information retrieval systems. One approach to NIR is to use a retrieve-and-rerank
strategy, where a traditional IR method is used to retrieve a set of candidate documents, and a neural network model is used to rerank these candidates based on their relevance to the query.
In this tutorial, we'll build a simple NIR system that uses the BM25
algorithm for retrieval and a bi-encoder model for reranking. We'll use the Anserini IR toolkit for BM25 retrieval and the Hugging Face Transformers library for the bi-encoder model.
Step 1: Data Preparation
Download MS Marco docs from here and extract the TSV file to load into BM25.
! wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz
! gzip -d msmarco-docs.tsv.gz
# Load MS Marco dataset
with open('/kaggle/working/msmarco-docs.tsv') as f:
docs = [line.split('\t') for line in f.read().splitlines()]
doc_ids = [doc[0] for doc in docs]
doc_texts = [doc[1] for doc in docs]
Step 2: Instantiate the BM25 Function
from rank_bm25 import BM25Okapi
# Initialize BM25 model with the document texts
bm25 = BM25Okapi(doc_texts)
Step 3: Load the Bi-Encoder Model
We will use the Sentence Transformer bi-encoder model trained on the MSMarco Dataset. You can select your model of choice from their website.
from sentence_transformers import SentenceTransformer, util
# Load pre-trained Sentence Transformer bi-encoder model
model = SentenceTransformer('msmarco-distilbert-dot-v5')
Step 4: Make the main function to retrieve the documents
# Define a function to retrieve documents for a given query
def retrieve_documents(query, top_k=10):
# Use BM25 to get the initial ranking of documents
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
# Use Sentence Transformer bi-encoder to re-rank the documents
query_embedding = model.encode(query, convert_to_tensor=True)
doc_embeddings = model.encode(doc_texts, convert_to_tensor=True)
dot_similarities = util.dot_score(query_embedding, doc_embeddings)
# Combine the BM25 scores and dot score similarities to get the final scores
final_scores = [0.7 * bm25_scores[i] + 0.3 * cos_similarities[i] for i in range(len(docs))]
# Sort the documents by their final scores and return the top K documents
top_k_indices = sorted(range(len(final_scores)), key=lambda i: final_scores[i], reverse=True)[:top_k]
top_k_docs = [{'doc_id': doc_ids[i], 'doc_text': doc_texts[i], 'score': final_scores[i]} for i in top_k_indices]
return top_k_docs
Step 5: Using for Inference
retrieve_documents("What is the capital of the USA?")
Conclusion
In this tutorial, we see how to build a neural information retrieval system using the retrieve and rerank approach. We have used the MS Marco dataset to demonstrate how to train a BM25 model for retrieval and a Bi-Encoder model for reranking.
[Optional]
Checkout the public notebook containing the full implementation:

[Bonus]
If you are interested in diving deep into about Information Retrieval, you can checkout Sentence Transformer's page.
P.S. - This blog was generated with the help of ChatGPT which makes me think if writing 'Train your first...'
blogs are worth doing anymore. Any suggestions?
P.P.S. - ChatGPT did slip up in between steps and details since it can only generate a limited number of tokens in its prediction.
Image Courtesy: https://tmramalho.github.io/science/2020/06/02/information-retrieval-with-deep-neural-models/