HADDELA KANKANAMALAGE, Prasanna Sumathipala (2023). An analysis of search query evolution in document classification and clustering. Doctoral, Sheffield Hallam University. [Thesis]
Documents
33355:638882
PDF
Haddela_2023_PhD_AnAnalysisSearch.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Haddela_2023_PhD_AnAnalysisSearch.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (2MB) | Preview
Abstract
With the increasing use of data analytics in decision-making processes today, the analysis of document collections for various purposes has become a widely accepted area of research. Document classification and clustering are two intensely investigated and active areas of research due to the complex nature of the problem and its impact on society.
However, many of the popular methods developed to classify and cluster documents with high accuracy lack explanation to end users, which affects the trustworthiness of certain applications among them. Therefore, it is crucial to improve explainable classification and clustering methods.
One approach that has shown promise in this regard is the evolved search query (eSQ), a genetic algorithm (GA)-based approach for classification and clustering. GA-based methods excel at finding highly optimized solutions for complex problems, and eSQ has utilized this capability to develop classification and clustering methods that are also human interpretable.
The primary focus of this study is to analyse the eSQ approach to document classification and clustering with an emphasis on explainability. The investigation covers three perspectives of the eSQ-based methods: explainability, document classification, and document clustering. This thesis presents a taxonomy for classification based on human friendliness, empirical observations on the performance of eSQ classifiers using different feature selection methods, the effectiveness of eSQ classifiers for Sinhala documents, and the performance of eSQ clustering for Sinhala documents.
The research contributes significantly by categorizing popular classification methods using the new taxonomy, integrating feature selection methods into eSQ classifiers, enhancing Apache Lucene by incorporating the Sinhala language with basic pre-processing tools, and improving eSQ hybrid single word clustering methods. Notably, the eSQ-based classification and clustering methods demonstrate superior performance when document categories overlap.
More Information
Statistics
Downloads
Downloads per month over past year
Metrics
Altmetric Badge
Dimensions Badge
Share
Actions (login required)
View Item |