Salesforce: Speaker Series: COVID-19 Information Retrieval with Deep-Learning based Semantic Search, Question Answering, and Abstractive Summarization

Date: 

Wednesday, November 18, 2020, 1:00pm to 2:30pm

Location: 

Virtual

"​​​​​Join Andre Esteva in this weeks speaker series as he dives into COVID-19 Information Retrieval with Deep-Learning based Semantic Search, Question Answering, and Abstractive Summarization. 
For more information and to register please visit this webpage.
The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines.  As of August 2020, 200,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge.
Here we'll discuss CO-Search, a semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. 
CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns the 1000 most relevant documents, and a ranker, which sorts them.
The retriever is composed of a Siamese-BERT model that encodes query-level meaning, along with two keyword-based models (BM25, TF-IFD) that emphasize the most critical words of a query. The ranker assigns a relevance score to each document, computed from the outputs of (1) a question-answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. 
To account for the relatively limited dataset (standard search engines are typically trained on billions of web documents) we develop a text augmentation technique which splits documents into pairs of paragraphs and their contained citations, creating millions of [citation title, paragraph] tuples for training the retriever. We evaluate our system (available at einstein.ai/covid) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across key search engine metrics."