Annotated dataset for sentiment analysis and sarcasm detection: Bilingual code-mixed English-Malay social media data in the public security domain

Mohd Suhairi Md Suhaimin; Mohd Hanafi Ahmad Hijazi; Ervin Gubin Moung

Data in Brief (Aug 2024)

Annotated dataset for sentiment analysis and sarcasm detection: Bilingual code-mixed English-Malay social media data in the public security domain

Mohd Suhairi Md Suhaimin,
Mohd Hanafi Ahmad Hijazi,
Ervin Gubin Moung

Affiliations

Mohd Suhairi Md Suhaimin: Data Technology and Applications Research Group, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu 88400, Sabah, Malaysia; Polytechnic and Community College Education Department, Galeria PjH Aras 4-7, Jalan P4W Persiaran Perdana, 62100 Putrajaya, Malaysia
Mohd Hanafi Ahmad Hijazi: Data Technology and Applications Research Group, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu 88400, Sabah, Malaysia; Creative Advanced Machine Intelligence Research Centre, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu 88400, Sabah, Malaysia; Corresponding author.
Ervin Gubin Moung: Data Technology and Applications Research Group, Faculty of Computing and Informatics, Universiti Malaysia Sabah, Kota Kinabalu 88400, Sabah, Malaysia

Journal volume & issue: Vol. 55
p. 110663

Abstract

Read online

Sentiment analysis in the public security domain involves analysing public sentiment, emotions, opinions, and attitudes toward events, phenomena, and crises. However, the complexity of sarcasm, which tends to alter the intended meaning, combined with the use of bilingual code-mixed content, hampers sentiment analysis systems. Currently, limited datasets are available that focus on these issues. This paper introduces a comprehensive dataset constructed through a systematic data acquisition and annotation process. The acquisition process includes collecting data from social media platforms, starting with keyword searching, querying, and scraping, resulting in an acquired dataset. The subsequent annotation process involves refining and labelling, starting with data merging, selection, and annotation, ending in an annotated dataset. Three expert annotators from different fields were appointed for the labelling tasks, which produced determinations of sentiment and sarcasm in the content. Additionally, an annotator specialized in literature was appointed for language identification of each content. This dataset represents a valuable contribution to the field of natural language processing and machine learning, especially within the public security domain and for multilingual countries in Southeast Asia.

Published in Data in Brief

ISSN: 2352-3409 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Science (General)
Website: http://www.journals.elsevier.com/data-in-brief/

About the journal

Abstract

Keywords