Try your first Neural Network for Neural Information Retrieval

Information Retrieval Feb 20, 2023

Neural Information Retrieval (NIR) is an emerging research field that combines traditional Information Retrieval (IR) techniques with neural network models to improve the effectiveness and efficiency of information retrieval systems. One approach to NIR is to use a retrieve-and-rerank strategy, where a traditional IR method is used to retrieve a set of candidate documents, and a neural network model is used to rerank these candidates based on their relevance to the query.

The brute creation are often endangered.
Photo by Pierre Bamin / Unsplash

In this tutorial, we'll build a simple NIR system that uses the BM25 algorithm for retrieval and a bi-encoder model for reranking. We'll use the Anserini IR toolkit for BM25 retrieval and the Hugging Face Transformers library for the bi-encoder model.

Step 1: Data Preparation

Download MS Marco docs from here and extract the TSV file to load into BM25.

! wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz

! gzip -d msmarco-docs.tsv.gz

# Load MS Marco dataset
with open('/kaggle/working/msmarco-docs.tsv') as f:
    docs = [line.split('\t') for line in f.read().splitlines()]
doc_ids = [doc[0] for doc in docs]
doc_texts = [doc[1] for doc in docs]

Step 2: Instantiate the BM25 Function

from rank_bm25 import BM25Okapi

# Initialize BM25 model with the document texts
bm25 = BM25Okapi(doc_texts)

Step 3: Load the Bi-Encoder Model

We will use the Sentence Transformer bi-encoder model trained on the MSMarco Dataset. You can select your model of choice from their website.

from sentence_transformers import SentenceTransformer, util

# Load pre-trained Sentence Transformer bi-encoder model
model = SentenceTransformer('msmarco-distilbert-dot-v5')

Step 4: Make the main function to retrieve the documents

# Define a function to retrieve documents for a given query
def retrieve_documents(query, top_k=10):
    # Use BM25 to get the initial ranking of documents
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)

    # Use Sentence Transformer bi-encoder to re-rank the documents
    query_embedding = model.encode(query, convert_to_tensor=True)
    doc_embeddings = model.encode(doc_texts, convert_to_tensor=True)
    dot_similarities =  util.dot_score(query_embedding, doc_embeddings)

    # Combine the BM25 scores and dot score similarities to get the final scores
    final_scores = [0.7 * bm25_scores[i] + 0.3 * cos_similarities[i] for i in range(len(docs))]

    # Sort the documents by their final scores and return the top K documents
    top_k_indices = sorted(range(len(final_scores)), key=lambda i: final_scores[i], reverse=True)[:top_k]
    top_k_docs = [{'doc_id': doc_ids[i], 'doc_text': doc_texts[i], 'score': final_scores[i]} for i in top_k_indices]
    return top_k_docs

Step 5: Using for Inference

retrieve_documents("What is the capital of the USA?")

Conclusion

In this tutorial, we see how to build a neural information retrieval system using the retrieve and rerank approach. We have used the MS Marco dataset to demonstrate how to train a BM25 model for retrieval and a Bi-Encoder model for reranking.

[Optional]

Checkout the public notebook containing the full implementation:

IR Try
Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

[Bonus]

If you are interested in diving deep into about Information Retrieval, you can checkout Sentence Transformer's page.

P.S. - This blog was generated with the help of ChatGPT which makes me think if writing 'Train your first...' blogs are worth doing anymore. Any suggestions?

P.P.S. - ChatGPT did slip up in between steps and details since it can only generate a limited number of tokens in its prediction.

Image Courtesy: https://tmramalho.github.io/science/2020/06/02/information-retrieval-with-deep-neural-models/

Tags