COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Sanku, Satya Uday; Pavani, Satti Thanuja; Tangirala, Jaya Lakshmi; Chivukula, Rohit

COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Tools

SANKU, Satya Uday, PAVANI, Satti Thanuja, TANGIRALA, Jaya Lakshmi and CHIVUKULA, Rohit (2024). COVID-19 Literature Mining and Retrieval Using Text Mining Approaches. SN Computer Science, 5: 211. [Article]

[+][-]

Documents

33295:638233

[+][-]

33295:638233

[thumbnail of Covid_19_Research_Literature_Mining_AAV.pdf]

Preview

PDF
Covid_19_Research_Literature_Mining_AAV.pdf - Accepted Version
Available under License All rights reserved.

Download (1MB) | Preview

Abstract

In light of the recent COVID-19 epidemic, users are facing growing difficulties in navigating the vast expanse of Internet content to locate relevant information. In this study, we have developed an information extraction mechanism to address users’ inquiries pertaining to COVID-19, catering to a range of depths in response. To accomplish this objective, the CORD-19 dataset, which has been made available by the Allen Institute for AI, is utilized. This dataset comprises 200,000 scholarly articles that pertain to research papers on the topic of coronavirus. These articles have been sourced from many reputable platforms, such as PubMed’s PMC, WHO, bioRxiv, and medRxiv pre-prints. In addition to the aforementioned document corpus, a supplementary list of topics has been furnished, encompassing inquiries pertaining to the infection. Each topic consists of three levels of representations, namely query, question, and story. Inquiry can take on different forms, with query representing a fundamental form, question serving as a more intermediate form, and narrative embodying a more detailed and elaborate type of inquiry. The proposed model uses various word embedding techniques, such as frequency based (Bag-of-words), semantic based (Word2Vec), a hybrid method which combine frequency with semantic (TF–IDF weighted Word2Vec), as well as sequence cum semantic based (BERT) to fabricate vectors for the documents in the corpus, query, question, narrative, and combinations of them. Once vectors have been created, cosine similarity is employed to identify similarities between document vectors and topic vectors. As compared to frequency and semantic models, BERT demonstrates a higher degree of relevance in retrieving documents. with 90% accuracy. The proposed hybrid model, which is the TF–IDF weighted Word2Vec, achieves an accuracy rate of 87%. This is comparable to the average performance of the BERT-Base model demonstrating computational efficiency.

More Information

Official URL:

https://link.springer.com/article/10.1007/s42979-0...

Uncontrolled Keywords:

46 Information and computing sciences

Identifiers

Identification Number:

10.1007/s42979-023-02550-1

ORCID for Jaya Lakshmi Tangirala: