An analysis of search query evolution in document classification and clustering

HADDELA KANKANAMALAGE, Prasanna Sumathipala (2023). An analysis of search query evolution in document classification and clustering. Doctoral, Sheffield Hallam University.

[img]
Preview
PDF
Haddela_2023_PhD_AnAnalysisSearch.pdf - Accepted Version
Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Link to published version:: https://doi.org/10.7190/shu-thesis-00587

Abstract

With the increasing use of data analytics in decision-making processes today, the analysis of document collections for various purposes has become a widely accepted area of research. Document classification and clustering are two intensely investigated and active areas of research due to the complex nature of the problem and its impact on society. However, many of the popular methods developed to classify and cluster documents with high accuracy lack explanation to end users, which affects the trustworthiness of certain applications among them. Therefore, it is crucial to improve explainable classification and clustering methods. One approach that has shown promise in this regard is the evolved search query (eSQ), a genetic algorithm (GA)-based approach for classification and clustering. GA-based methods excel at finding highly optimized solutions for complex problems, and eSQ has utilized this capability to develop classification and clustering methods that are also human interpretable. The primary focus of this study is to analyse the eSQ approach to document classification and clustering with an emphasis on explainability. The investigation covers three perspectives of the eSQ-based methods: explainability, document classification, and document clustering. This thesis presents a taxonomy for classification based on human friendliness, empirical observations on the performance of eSQ classifiers using different feature selection methods, the effectiveness of eSQ classifiers for Sinhala documents, and the performance of eSQ clustering for Sinhala documents. The research contributes significantly by categorizing popular classification methods using the new taxonomy, integrating feature selection methods into eSQ classifiers, enhancing Apache Lucene by incorporating the Sinhala language with basic pre-processing tools, and improving eSQ hybrid single word clustering methods. Notably, the eSQ-based classification and clustering methods demonstrate superior performance when document categories overlap.

Item Type: Thesis (Doctoral)
Contributors:
Thesis advisor - Hirsch, Laurence [0000-0002-3589-9816]
Additional Information: Director of studies: Dr. Laurence Hirsch "No PQ harvesting"
Identification Number: https://doi.org/10.7190/shu-thesis-00587
Depositing User: Colin Knott
Date Deposited: 05 Mar 2024 16:58
Last Modified: 06 Mar 2024 02:00
URI: https://shura.shu.ac.uk/id/eprint/33355

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics