COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

SANKU, Satya Uday, PAVANI, Satti Thanuja, TANGIRALA, Jaya Lakshmi and CHIVUKULA, Rohit (2024). COVID-19 Literature Mining and Retrieval Using Text Mining Approaches. SN Computer Science, 5: 211.

[img] PDF
Covid_19_Research_Literature_Mining_AAV.pdf - Accepted Version
Restricted to Repository staff only until 17 January 2025.
All rights reserved.

Download (1MB)
Official URL:
Link to published version::


In light of the recent COVID-19 epidemic, users are facing growing difficulties in navigating the vast expanse of Internet content to locate relevant information. In this study, we have developed an information extraction mechanism to address users’ inquiries pertaining to COVID-19, catering to a range of depths in response. To accomplish this objective, the CORD-19 dataset, which has been made available by the Allen Institute for AI, is utilized. This dataset comprises 200,000 scholarly articles that pertain to research papers on the topic of coronavirus. These articles have been sourced from many reputable platforms, such as PubMed’s PMC, WHO, bioRxiv, and medRxiv pre-prints. In addition to the aforementioned document corpus, a supplementary list of topics has been furnished, encompassing inquiries pertaining to the infection. Each topic consists of three levels of representations, namely query, question, and story. Inquiry can take on different forms, with query representing a fundamental form, question serving as a more intermediate form, and narrative embodying a more detailed and elaborate type of inquiry. The proposed model uses various word embedding techniques, such as frequency based (Bag-of-words), semantic based (Word2Vec), a hybrid method which combine frequency with semantic (TF–IDF weighted Word2Vec), as well as sequence cum semantic based (BERT) to fabricate vectors for the documents in the corpus, query, question, narrative, and combinations of them. Once vectors have been created, cosine similarity is employed to identify similarities between document vectors and topic vectors. As compared to frequency and semantic models, BERT demonstrates a higher degree of relevance in retrieving documents. with 90% accuracy. The proposed hybrid model, which is the TF–IDF weighted Word2Vec, achieves an accuracy rate of 87%. This is comparable to the average performance of the BERT-Base model demonstrating computational efficiency.

Item Type: Article
Uncontrolled Keywords: 46 Information and computing sciences
Identification Number:
SWORD Depositor: Symplectic Elements
Depositing User: Symplectic Elements
Date Deposited: 27 Feb 2024 15:38
Last Modified: 01 Mar 2024 08:00

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics