A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media

CUMBERLAND, Ethan and DAY, Tony (2021). A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media. In: 2021 International Symposium on Computer Science and Intelligent Controls (ISCSIC). IEEE. [Book Section]

Documents
29589:598747
[thumbnail of Cumberland-PrescriptiveApproachStructured(AM).pdf]
Preview
PDF
Cumberland-PrescriptiveApproachStructured(AM).pdf - Accepted Version
Available under License All rights reserved.

Download (351kB) | Preview
Abstract
In this paper we present ongoing research into extracting highly structured data - such as authors, posts, the links between them, and the metadata about them - from social media and fora using a prescriptive approach, building upon simple observations and generalised rules. This method uses techniques designed around identifying content based on text features, such as text density, and combines it with simple rules derived from studying the common structures of the target web pages to infer and extract structure from structured data. We discuss observations made from studying a number of social media web sites and forums and present the simple rules for post, content and attribute identification developed from these observations. We also present the structured format used to store the extracted data and some of the benefits of this structure. Next, we give initial experimental results, showing that the proposed approach can achieve accuracies above 90% for identifying posts, 70% for extracting content from these posts, and 50-70% for extracting additional attributes about the posts and their authors. We highlight factors influencing these results, before finally detailing the next steps for this research. Our research shows that it is possible to achieve reasonable levels of accuracy for extracting structured data using an approach that requires no training and is transferable between different social media and web forums with no additional input necessary. This approach thus promises considerable efficiency gains compared to the training involved with current machine learning-based approaches, whilst maintaining reasonable performance.
More Information
Statistics

Downloads

Downloads per month over past year

View more statistics

Metrics

Altmetric Badge

Dimensions Badge

Share
Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Actions (login required)

View Item View Item