Pakistan Journal of Engineering & Technology (Jan 2024)
Development and Evaluation of Gold Standard Dataset for Sentiment Analysis of Tweets
Abstract
Pre-labeled data is typically required for supervised machine learning. A limited number of object classes in the majority of open access and pre-annotated datasets make them unsuitable for certain tasks, even though they are readily available for training machine learning algorithms. For custom models, previously available pre-annotated data is typically insufficient, so gathering and preparing training data is necessary for the majority of real-world applications. The quantity and quality of annotations clearly trade-off with one another. Either more annotated data can be produced or better data quality can be guaranteed by allocating time and resources. Development of the gold standard by annotating textual information is an essential part of the “Text Analytics” domain in the field of “Natural Language Processing-NLP”. In “Text Analytics”, annotation can be done by adopting a manual, semi-automatic or automatic approach. In the case of the manual approach, annotators often work with partial parts of the corpus, and the results are generalized by automated text classification which may affect the final classification results. Annotations reliability and suitability of assigned labels are particularly important in the NLP applications related to opinion mining or sentiment analysis. In this research study, we have evaluated the significance of the annotation process on a novel dataset that contained multiple languages (English, Roman Urdu), a free text dataset that was extracted from Twitter. This unique dataset contained multiple languages which makes this annotation process essential for researching this data. Using this multi-language dataset, we examine the inter-annotator agreement in multiclass and multi-label sentiment annotation. To scrutinize the reliability of this research work, several annotation agreement metrics, statistical analysis, and Machine Learning methods have been considered to evaluate the accuracy of resulting annotations. It was observed that the annotation process is significant and a complex step that is essential for the proper implementation of Natural Language processing tasks for text analytics in machine learning. During this research, different gaps were identified and resolved which can impact the overall reliability of the annotation process which are reported in this paper. We conclude that while inaccurate annotations worsen the results, the impact is minimal, at least when using text data. The advantages of the larger annotated data set (obtained by employing subpar auto-annotation techniques) surpass the degradation resulting from the use of annotated data.