Predicting Missing Data Values Using Formal Concept Analysis

PAIPETIS, Alexandros (2025). Predicting Missing Data Values Using Formal Concept Analysis. Doctoral, Sheffield Hallam University. [Thesis]

Documents
36381:1096199
[thumbnail of Paipetis_2025_PhD_PredictingMissingData.pdf]
PDF
Paipetis_2025_PhD_PredictingMissingData.pdf - Accepted Version
Restricted to Repository staff only until 11 September 2026.
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB)
Abstract
The aim of this research is to use Formal Concept Analysis (FCA) to predict missing values in the datasets. Missing values pose a significant challenge to the accuracy and reliability of data-driven analysis, affecting workflows and compromising outcomes across various domains. Existing imputation methods often rely on strong statistical assumptions and lack flexibility, applicability, and interpretability. These constraints reduce their effectiveness in real-world scenarios. To address these limitations, this thesis proposes Fault-Tolerant Formal Concept Analysis as a solution to enhance the prediction of missing values. FCA organises object-attributes relations into formal concepts under strict closure conditions. The integration of fault tolerance relaxes these operators, constructing approximate concepts in which an object or attribute can be included despite a bounded number of missing associations. These approximate concepts capture patterns which can be used to predict missing relations. Building on this approach, this research demonstrates that predicting missing values is indicative of predicting class labels. This perspective enables Fault-Tolerant FCA to be applied in both imputation and classification. To develop this technique, the In-Close algorithm was adapted to incorporate fault tolerance, extending its functionality to these predictive tasks. Two case studies were adopted to validate the proposed technique. The first utilised datasets from the widely recognised UCI Machine Learning Repository. The datasets Mushroom, Adult Census Income, and Nursery were selected due to their diverse characteristics and extensive application in data analytics benchmarking. The second case study employed the Edinburgh Mouse Atlas Gene Expression (EMAGE) database, a specialized biological resource that presents a critical test case due to the subjective assessments involved in its gene expression annotations. Evaluation of both case studies demonstrated that the proposed technique is practically effective and capable of addressing real-world data challenges. The performance of the proposed technique was rigorously evaluated through a series of experiments. These experiments were designed to assess both the intrinsic effectiveness of the FCA-based approach and to benchmark its performance against established machine learning methods. The results show that the proposed FCA technique achieved high accuracy in predicting missing values, often matching or outperforming the performance of traditional methods across various contexts. Although the technique may not recover every missing value in all scenarios, its overall performance remained robust and reliable. This research contributes to the field of FCA by extending its utility beyond concept discovery. Although FCA and fault tolerance have previously been explored for handling uncertainty and iii noise, their explicit application to missing values prediction, as undertaken in this study, represents a novel advancement. The empirical findings, combined with FCA’s inherent interpretability, demonstrate its potential for missing value restoration and highlight its broader role in modern data science.
More Information
Metrics

Altmetric Badge

Dimensions Badge

Share
Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Actions (login required)

View Item View Item