Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2024)
Enhanced Multi-Label Question Tagging on Stack Overflow: A Two-Stage Clustering and DeBERTa-Based Approach
Abstract
This paper introduces a novel method for automatically classifying questions with multiple labels, using data specifically sourced from Stack Overflow. Traditional tagging methods frequently face challenges due to the complexity and semantic diversity of these questions, resulting in inconsistent and sometimes inaccurate results. The process starts with preprocessing to remove any unwanted elements. Next, we convert the questions into meaningful representations using SMPNet. The semantic vectors obtained are then processed using UMAP to help us understand the overall structure of the data and make it easier to cluster similar items. After dimensionality reduction with UMAP, we use the K-Means method to group the questions into clusters, with the best number of groups determined by the Silhouette Score. Finally, a fine-tuned DeBERTa model is trained for each cluster to accurately predict the appropriate tags. Our approach significantly outperforms traditional methods, achieving 2% improvement over the best baseline. This strategy improves model efficiency by narrowing the focus to specific subsets of data.
Keywords