Document clustering with evolved search queries

HIRSCH, Laurence and DI NUOVO, Alessandro (2017). Document clustering with evolved search queries. In: 2017 IEEE Congress on Evolutionary Computation (CEC), Proceedings : 5-8 June 2017, Donostia - San Sebastián, Spain. Piscataway, NJ, IEEE, 1239-1246.

Hirsch_DiNuovo_TextClusteringGA.pdf - Accepted Version
All rights reserved.

Download (319kB) | Preview
Official URL:
Link to published version::
Related URLs:


Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.

Item Type: Book Section
Additional Information: Paper original presented at IEEE Congress on Evolutionary Computation 2017 Donostia - San Sebastián, Spain. IEEE Catalog Number: CFP17ICE-ART
Uncontrolled Keywords: text clustering genetic algorithm
Departments - Does NOT include content added after October 2018: Faculty of Science, Technology and Arts > Department of Computing
Identification Number:
Page Range: 1239-1246
Depositing User: Laurence Hirsch
Date Deposited: 12 Apr 2017 14:46
Last Modified: 18 Mar 2021 15:34

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics