Appropriating Data from Structured Sources for Formal Concept Analysis

ORPHANIDES, Constantinos (2022). Appropriating Data from Structured Sources for Formal Concept Analysis. Doctoral, Sheffield Hallam University.

[img]
Preview
PDF
Orphanides_2022_PhD_AppropriatingDataFrom.pdf - Accepted Version
Creative Commons Attribution Non-commercial No Derivatives.

Download (18MB) | Preview
Link to published version:: https://doi.org/10.7190/shu-thesis-00503

Abstract

Formal Concept Analysis (FCA) is a principled way of deriving a concept hierarchy from a collection of objects and their associated attributes, building on the mathematical theory of lattices and ordered sets. To conduct FCA, some appropriation steps need to be taken on a set of data as a prerequisite. Firstly, the data need to be acquired from a data source and decisions need to be made about how the data will be analyzed, such as which objects will be included in the analysis, and how each attribute should be interpreted. They then need to be transformed into a formal context, which can then be visualized as a formal concept lattice. Transforming a formal context into its constituent formal concepts is a process which is well defined and well understood in literature. The same holds true for converting formal concepts into a formal concept lattice. On the other hand, the process of appropriating a dataset into a formal context tends to be an ad-hoc and bespoke one. ToscanaJ can produce formal contexts from a relational database, while ConExp can produce simple formal contexts in a manual fashion. In the CUBIST project, Dau developed a semi-automated, scalingless approach to generate formal contexts out of a triple store by concatenating the object-attribute pairs returned from the resulting table into their corresponding formal attributes, while Orphanides developed an approach that also provided scaling capabilities, albeit again relying on triple store data. Cubix, the final prototype of the CUBIST project, incorporated the approaches of Dau and Orphanides in an interactive web frontend. FcaBedrock is an Open-source software (OSS) developed as part of this study, employing a series of steps to appropriate data for FCA in a semi-automated, user-driven environment. To underpin this work, we take a case study approach, using two case studies in particular: the UCI Machine Learning (ML) Repository—a dataset repository for the empirical analysis of machine learning algorithms—and the e-Mouse Atlas of Gene Expression (EMAGE), a database of anatomical terms for each Theiler Stage (TS) in mouse development. We compare our approach with existing approaches, using datasets from the two case studies. The appropriation of the datasets become an integral part of our evaluation, providing the real-life context and use-cases of our proposed approach. The UCI ML Repository and EMAGE case studies revealed how prior to this study, a multitude of existing data sources, types and formats, were either not accessible, or not easily accessible to FCA; the data appropriation processes were in most cases tedious and time-consuming, often requiring the manual creation of formal contexts out of datasets. In other cases, rigid, non-flexible approaches were developed, with hardcoded assumptions made about the underlying use-case they were developed for. This is unlike the software and techniques developed in this study, where the same semi-automated steps can consistently facilitate the appropriation of data for FCA, from the most common data sources and their underlying data types. The aim of this study was to discover how effective each FCA approach is in appropriating data for FCA, answering the research question: “How can data from structured sources, consisting of various data types, be acquired and appropriated for FCA in a semi-automated, user-driven environment?”. FcaBedrock emerged as the best appropriation approach, by abstracting the issue of structured sources away using the ubiquitous CSV (Comma Separated Value) file format as an input source and by providing both automated and semi-automated means of creating meaningful formal contexts from a dataset. Dau’s CUBIST scalingless approach, while semi-automated, was restricted to RDF triple stores and did not provide any flexibility as to how each attribute of the dataset should be interpreted. Orphanides’s CUBIST scaleful approach, while providing more flexibility with its scaling capabilities, was again restricted to RDF triple stores. The CUBIST interactive approach improved upon those ideas, by allowing the user to drive the analysis via a user-friendly web frontend. ToscanaJ was restricted to only using SQL/NoSQL databases as an input source, while both ToscanaJ and ConExp provided no substantial appropriation techniques other than creating a formal context by hand.

Item Type: Thesis (Doctoral)
Contributors:
Thesis advisor - Andrews, Simon
Additional Information: Director of Studies: Professor Simon Andrews
Research Institute, Centre or Group - Does NOT include content added after October 2018: Sheffield Hallam Doctoral Theses
Identification Number: https://doi.org/10.7190/shu-thesis-00503
Depositing User: Justine Gavin
Date Deposited: 16 Feb 2023 12:52
Last Modified: 31 Oct 2023 01:18
URI: https://shura.shu.ac.uk/id/eprint/31513

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics