Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Woon-Kyo Lee; Ja-Hee Kim

doi:10.17559/tv-20231214001207

Tehnički Vjesnik (Jan 2024)

Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Woon-Kyo Lee,
Ja-Hee Kim

Affiliations

Woon-Kyo Lee: Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea
Ja-Hee Kim: Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea

DOI: https://doi.org/10.17559/tv-20231214001207
Journal volume & issue: Vol. 31, no. 6
pp. 1845 – 1858

Abstract

Read online

Abbreviation ambiguity poses significant challenges when searching academic literature. This study evaluated the accuracy of clustering algorithms on imbalanced datasets with varying ratios of target groups. A corpus consisting of 1052 papers focused on the study of abbreviations. The "MSA" dataset was clustered using TF-IDF, cosine similarity, and k-means. Clustering performance declined as the ratios in the target group deviated from balanced thresholds. A re-clustering method was introduced, involving the selective exclusion of non-target clusters. Re-clustering improved accuracy and F1 scores in most scenarios, demonstrating particular stability with higher cluster counts. The re-clustering performance of comparisons was stronger when compared to k-means and self-adaptive methods. The study highlights issues stemming from data imbalance and presents an effective strategy for enhancing abbreviation search efficiency.

imbalanced data, K-means algorithm, Re-clustering, word sense disambiguation

Published in Tehnički Vjesnik

ISSN: 1330-3651 (Print); 1848-6339 (Online)
Publisher: Faculty of Mechanical Engineering in Slavonski Brod, Faculty of Electrical Engineering in Osijek, Faculty of Civil Engineering in Osijek
Country of publisher: Croatia
LCC subjects: Technology: Engineering (General). Civil engineering (General)
Website: http://hrcak.srce.hr/tehnicki-vjesnik

About the journal

Abstract

Keywords