Journal of Applied Informatics and Computing (Dec 2023)

Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

  • Made Agus Putra Subali,
  • I Gusti Rai Agung Sugiartha,
  • Komang Budiarta,
  • I Made Budi Adnyana

DOI
https://doi.org/10.30871/jaic.v7i2.6476
Journal volume & issue
Vol. 7, no. 2
pp. 258 – 262

Abstract

Read online

Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.

Keywords