HIRSCH, Laurence and DI NUOVO, Alessandro (2017). Document clustering with evolved search queries. In: 2017 IEEE Congress on Evolutionary Computation (CEC), Proceedings : 5-8 June 2017, Donostia - San Sebastián, Spain. Piscataway, NJ, IEEE, 1239-1246. [Book Section]
Documents
15409:132196
PDF
Hirsch_DiNuovo_TextClusteringGA.pdf - Accepted Version
Available under License All rights reserved.
Hirsch_DiNuovo_TextClusteringGA.pdf - Accepted Version
Available under License All rights reserved.
Download (319kB) | Preview
Abstract
Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.
More Information
Statistics
Downloads
Downloads per month over past year
Metrics
Altmetric Badge
Dimensions Badge
Share
Actions (login required)
View Item |