Build a QA System using BERT and Hugging Face
A chatbot is an AI software that can simulate a conversation (or a chat) with a user through messaging applications, websites, mobile apps or through the telephone.
A chatbot is often described as one of the most advanced and promising expressions of interaction between humans and machines. However, from a technological point of view, a chatbot only represents the natural evolution of a Question Answering system leveraging Natural Language Processing (NLP). Formulating responses to questions in natural language is one of the most typical Examples of Natural Language Processing applied in various enterprises’ end-use applications.
BERT (Bidirectional Encoder Representations from Transformers) has started a revolution in NLP with state of the art results in various tasks, including Question Answering, GLUE Benchmark, and others. People even referred to this as the ImageNet moment of NLP.
Question Answering systems are built on pairs of question and contexts.
You can check out an example hosted version here.
The Tutorial:
In this tutorial, we will use a pre-trained modified version of BERT from Hugging Face which was trained on Squad 2.0 dataset. We will provide the questions and for context, we will use the first match article from Wikipedia through wikipedia package in Python. Then we will tokenize the article using the AutoTokenizer model in order for the AutoModelForQuestionAnswering model to predict the sequence of words which will be our answer.
A little background:
The model we are using was originally trained on masked datasets where the researchers masked key words in a huge corpus and the task for the model was to predict that word. The QA system uses a similar system for its set of tasks.
Now, let's get into the tutorial.
First we will create a class that will compile the model import and tokenizing of the question and matched wikipedia article.
class QASystemWithBERT:
def __init__(self, pretrained_model_name_or_path='bert-large-uncased'):
self.READER_PATH = pretrained_model_name_or_path
self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH
)
self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH
)
self.max_len = self.model.config.max_position_embeddings
self.chunked = False
def tokenize(self, question, text):
self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt", return_token_type_ids=True)
self.input_ids = self.inputs["input_ids"].tolist()[0]
if len(self.input_ids) > self.max_len:
self.inputs = self.chunkify()
self.chunked = True
def chunkify(self):
"""
Break up a long article into chunks that fit within the max token
requirement for that Transformer model.
"""
qmask = self.inputs['token_type_ids'].lt(1)
qt = torch.masked_select(self.inputs['input_ids'], qmask)
chunk_size = self.max_len - qt.size()[0] - 1
chunked_input = OrderedDict()
for k,v in self.inputs.items():
q = torch.masked_select(v, qmask)
c = torch.masked_select(v, ~qmask)
chunks = torch.split(c, chunk_size)
for i, chunk in enumerate(chunks):
if i not in chunked_input:
chunked_input[i] = {}
thing = torch.cat((q, chunk))
if i != len(chunks)-1:
if k == 'input_ids':
thing = torch.cat((thing, torch.tensor([102])))
else:
thing = torch.cat((thing, torch.tensor([1])))
chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
return chunked_input
def get_answer(self):
if self.chunked:
answer = ''
for k, chunk in self.inputs.items():
answer_start_scores, answer_end_scores = self.model(**chunk)[:2]
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
if ans != '[CLS]':
answer += ans + " / "
return answer
else:
answer_start_scores, answer_end_scores = self.model(**self.inputs)[:2]
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
return self.convert_ids_to_string(self.inputs['input_ids'][0][
answer_start:answer_end])
def convert_ids_to_string(self, input_ids):
return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))
Now, we are going to pass a list of questions and get the answers to our questions:
questions = [
'Where is Microsoft Headquarters located?',
'Who is the President of the United States of America?',
'How many sides does a hexagon have?'
]
qas = QASystemWithBERT("deepset/bert-base-cased-squad2")
for question in questions:
print(f"Question: {question}")
results = wiki.search(question)
page = wiki.page(results[0])
print(f"Top wiki result: {page}")
text = page.content
qas.tokenize(question, text)
print(f"Answer: {qas.get_answer()}")
print()
Well, it answered all three questions correctly (after 10 retries on each question & changing wordings of the question numerous times), have a look:

[Optional]
Checkout the public notebook below:

[Bonus]
Checkout the below post in order to better understand BERT and the impact it has had on NLP research.

Cheers!