Building a Natural Language Question and Answer Search Engine
Building a Natural Language Question and Answer Search Engine

Abstract: 

Using articles published on the NASDAQ website as a data source, a search engine that gives direct answers to naturally phrased questions will be built by attendees. An example query might be, 'What market does FitBit compete in?', and the expected output for this query would be something akin to: 'wearables'.

To accomplish this task, Python and the open-source libraries Whoosh and DeepPavlov will be used as the main tools to build the search engine. Whoosh will be used to build a more traditional text-based search engine, and a pre-trained Deep Learning model from DeepPavlov will be used to perform natural language Q&A with the documents returned by Whoosh.

While covering Whoosh there will be a general discussion of information retrieval and deeper dives into the NLP tasks being performed by Whoosh. These techniques include tokenization, stopword removal, n-grams, stemming, lemmatization, and Named Entity Recognition.

While covering DeepPavlov there will be a discussion of the pre-trained model being used. The pre-trained model of interest was trained on the Stanford Question Answering Dataset (SQuAD) using a Deep Learning model utilizing Bidirectional Encoder Representations from Transformers (BERT). The discussion will be focused on the practical application of the model, and it will not cover how to recreate this model from scratch.

Each step of this search engine build process will have coding exercises to help attendees internalize the information presented. All data and Python code used for the training session will be open-sourced and available on GitHub to be used as a reference for both live and remote attendees.

Bio: 

Adam Spannbauer is a machine learning engineer at Eastman Chemical Company in East Tennessee. His work history has a focus on NLP projects using open source tools such as Python, R, and Shiny. Adam has instructor experience through his DataCamp course: "Software Engineering for Data Scientists in Python.” Outside of work, Adam stays active in the open source community on GitHub, mostly working on side-projects involving computer vision. He holds degrees from Maryville College and the University of Tennessee.