ORPHANIDES, Constantinos (2022). Appropriating Data from Structured Sources for Formal Concept Analysis. Doctoral, Sheffield Hallam University. [Thesis]
Documents
31513:614235
PDF
Orphanides_2022_PhD_AppropriatingDataFrom.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Orphanides_2022_PhD_AppropriatingDataFrom.pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.
Download (18MB) | Preview
Abstract
Formal Concept Analysis (FCA) is a principled way of deriving a concept hierarchy from a collection of objects and their associated attributes, building on the mathematical theory of lattices and ordered sets.
To conduct FCA, some appropriation steps need to be taken on a set of data as a prerequisite. Firstly, the data need to be acquired from a data source and decisions need to be made about how the data will be analyzed, such as
which objects will be included in the analysis, and how each attribute should be interpreted. They then need to be transformed into a formal context, which can then be visualized as a formal concept lattice.
Transforming a formal context into its constituent formal concepts is a process which is well defined and well understood in literature. The same holds true for converting formal concepts into a formal concept lattice.
On the other hand, the process of appropriating a dataset into a formal context tends to be an ad-hoc and bespoke one. ToscanaJ can produce formal contexts from a relational database, while ConExp can produce simple formal
contexts in a manual fashion. In the CUBIST project, Dau developed a semi-automated, scalingless approach to generate formal contexts out of a triple store by concatenating the object-attribute pairs returned from the resulting
table into their corresponding formal attributes, while Orphanides developed an approach that also provided scaling capabilities, albeit again relying on triple store data. Cubix, the final prototype of the CUBIST project, incorporated the approaches of Dau and Orphanides in an interactive web frontend.
FcaBedrock is an Open-source software (OSS) developed as part of this study, employing a series of steps to appropriate data for FCA in a semi-automated, user-driven environment. To underpin this work, we take a case
study approach, using two case studies in particular: the UCI Machine Learning (ML) Repository—a dataset repository for the empirical analysis of machine learning algorithms—and the e-Mouse Atlas of Gene Expression (EMAGE), a
database of anatomical terms for each Theiler Stage (TS) in mouse development. We compare our approach with existing approaches, using datasets from the two case studies. The appropriation of the datasets become an integral part of
our evaluation, providing the real-life context and use-cases of our proposed approach.
The UCI ML Repository and EMAGE case studies revealed how prior to this study, a multitude of existing data sources, types and formats, were either not accessible, or not easily accessible to FCA; the data appropriation processes were in most cases tedious and time-consuming, often requiring the
manual creation of formal contexts out of datasets. In other cases, rigid, non-flexible approaches were developed, with hardcoded assumptions made about the underlying use-case they were developed for. This is unlike the software
and techniques developed in this study, where the same semi-automated steps can consistently facilitate the appropriation of data for FCA, from the most common data sources and their underlying data types.
The aim of this study was to discover how effective each FCA approach is in appropriating data for FCA, answering the research question: “How can data from structured sources, consisting of various data types, be acquired
and appropriated for FCA in a semi-automated, user-driven environment?”. FcaBedrock emerged as the best appropriation approach, by abstracting the issue of structured sources away using the ubiquitous CSV (Comma Separated Value) file format as an input source and by providing both automated and semi-automated means of creating meaningful formal contexts from a dataset. Dau’s CUBIST scalingless approach, while semi-automated, was restricted to RDF triple stores and did not provide any flexibility as to how each attribute
of the dataset should be interpreted. Orphanides’s CUBIST scaleful approach, while providing more flexibility with its scaling capabilities, was again restricted to RDF triple stores. The CUBIST interactive approach improved upon those
ideas, by allowing the user to drive the analysis via a user-friendly web frontend. ToscanaJ was restricted to only using SQL/NoSQL databases as an input source, while both ToscanaJ and ConExp provided no substantial appropriation
techniques other than creating a formal context by hand.
More Information
Statistics
Downloads
Downloads per month over past year
Metrics
Altmetric Badge
Dimensions Badge
Share
Actions (login required)
View Item |