Document Clustering with Evolved Single Word Search Queries

HIRSCH, Laurence, DI NUOVO, Alessandro and PRASANNA, Haddela (2021). Document Clustering with Evolved Single Word Search Queries. In: 2021 IEEE Congress on Evolutionary Computation (CEC). IEEE.

[img]
Preview
PDF
2021076864.pdf - Accepted Version
All rights reserved.

Download (556kB) | Preview
Official URL: https://ieeexplore.ieee.org/document/9504770
Link to published version:: https://doi.org/10.1109/CEC45853.2021.9504770

Abstract

We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of single word search queries in Apache Lucene format. Clusters are formed as the set of documents matching a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query in a set). Optionally, the number of clusters can be specified in advance, which will normally result in an improvement in performance. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and compare effectiveness with other well-known existing systems on 8 different text datasets. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction.

Item Type: Book Section
Additional Information: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Identification Number: https://doi.org/10.1109/CEC45853.2021.9504770
SWORD Depositor: Symplectic Elements
Depositing User: Symplectic Elements
Date Deposited: 26 Apr 2021 15:59
Last Modified: 16 Sep 2021 15:15
URI: https://shura.shu.ac.uk/id/eprint/28567

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics