Statistika (Jan 2022)

Classification of Public Opinion on Social Media Twitter concerning the Education in Indonesia Using the K-Nearest Neighbors (K-NN) Algorithm and K-Fold Cross Validation

  • Intan Monica Hanmastiana,
  • Budi Warsito,
  • Rita Rahmawati,
  • Hasbi Yasin,
  • Puspita Kartikasari

DOI
https://doi.org/10.29313/statistika.v21i2.297
Journal volume & issue
Vol. 21, no. 2
pp. 99 – 106

Abstract

Read online

Developing country is a country that has perspective and idea which reflect its awareness of the importance of advancing the education sector. Assessment of the quality of education in Indonesia from the perspective of the community gets different responses. Therefore, it makes people respond differently. The community response is often found on social media, one of which is Twitter. Twitter is one of the application service that is popular due to its uses to interact and communicate with people in daily life. The sentiment analysis on Twitter can be a choice to see the community’s responses to the condition of education in Indonesia. The responses are classified into positive sentiments and negative sentiments using the K-Nearest Neighbors (K-NN) algorithm with a 10-fold cross validation model evaluation. K-NN has several advantages, they are fast training, simple, easy to learn, resistance toward training data which has noise, and effective if the training data is large. In this study, the sentiment classification uses Cosine Similarity distance measurement and four k value parameters which are 3, 5, 7, and 9. Data labelling is done manually and done by scoring sentiment. Visualization of positive and negative sentiments use Word Cloud. The test results show that public sentiment about education tends to be positive on Twitter and the parameter k = 7 obtained the highest accuracy value in data labelling that was done manually and done by scoring sentiment. In labelling data manually, it obtained an accuracy of 76.93% whereas, in labelling the data with scoring sentiment, it obtained an accuracy of 77.87%. Sentiment analysis is made using the RStudio programming language as the support software.

Keywords