Multi-label emotion classification of Urdu tweets

Noman Ashraf; Lal Khan; Sabur Butt; Hsien-Tsung Chang; Grigori Sidorov; Alexander Gelbukh

doi:10.7717/peerj-cs.896

PeerJ Computer Science (Apr 2022)

Multi-label emotion classification of Urdu tweets

Noman Ashraf,
Lal Khan,
Sabur Butt,
Hsien-Tsung Chang,
Grigori Sidorov,
Alexander Gelbukh

Affiliations

Noman Ashraf: CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Lal Khan: Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
Sabur Butt: CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Hsien-Tsung Chang: Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
Grigori Sidorov: CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh: CIC, Instituto Politécnico Nacional, Mexico City, Mexico

DOI: https://doi.org/10.7717/peerj-cs.896
Journal volume & issue: Vol. 8
p. e896

Abstract

Read online Read online

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords