A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media

CUMBERLAND, Ethan and DAY, Tony (2021). A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media. In: 2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC). IEEE.

Cumberland-PrescriptiveApproachStructured(AM).pdf - Accepted Version
All rights reserved.

Download (351kB) | Preview
Official URL: https://ieeexplore.ieee.org/document/9644296
Link to published version:: https://doi.org/10.1109/iscsic54682.2021.00028


In this paper we present ongoing research into extracting highly structured data - such as authors, posts, the links between them, and the metadata about them - from social media and fora using a prescriptive approach, building upon simple observations and generalised rules. This method uses techniques designed around identifying content based on text features, such as text density, and combines it with simple rules derived from studying the common structures of the target web pages to infer and extract structure from structured data. We discuss observations made from studying a number of social media web sites and forums and present the simple rules for post, content and attribute identification developed from these observations. We also present the structured format used to store the extracted data and some of the benefits of this structure. Next, we give initial experimental results, showing that the proposed approach can achieve accuracies above 90% for identifying posts, 70% for extracting content from these posts, and 50-70% for extracting additional attributes about the posts and their authors. We highlight factors influencing these results, before finally detailing the next steps for this research. Our research shows that it is possible to achieve reasonable levels of accuracy for extracting structured data using an approach that requires no training and is transferable between different social media and web forums with no additional input necessary. This approach thus promises considerable efficiency gains compared to the training involved with current machine learning-based approaches, whilst maintaining reasonable performance.

Item Type: Book Section
Additional Information: © 2021 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Identification Number: https://doi.org/10.1109/iscsic54682.2021.00028
SWORD Depositor: Symplectic Elements
Depositing User: Symplectic Elements
Date Deposited: 18 Jan 2022 16:31
Last Modified: 18 Jan 2022 16:45
URI: https://shura.shu.ac.uk/id/eprint/29589

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics