Sentiment analysis and resources for informal Arabic text on social media

ITANI, Maher (2018). Sentiment analysis and resources for informal Arabic text on social media. Doctoral, Sheffield Hallam University.

[img]
Preview
PDF
Itani_2018_phd_SentimentAnalysisAnd.pdf - Accepted Version
Creative Commons Attribution Non-commercial No Derivatives.

Download (2MB) | Preview
Link to published version:: https://doi.org/10.7190/shu-thesis-00118

Abstract

Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported.

Item Type: Thesis (Doctoral)
Additional Information: Director of studies - Chris Roast "No PQ harvesting"
Research Institute, Centre or Group: Sheffield Hallam Doctoral Theses
Identification Number: https://doi.org/10.7190/shu-thesis-00118
Depositing User: Louise Beirne
Date Deposited: 21 Nov 2018 11:25
Last Modified: 05 Dec 2018 09:04
URI: http://shura.shu.ac.uk/id/eprint/23402

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics