Jisuanji kexue (Jan 2023)

Study on Short Text Classification with Imperfect Labels

  • LIANG Haowei, WANG Shi, CAO Cungen

DOI
https://doi.org/10.11896/jsjkx.211100278
Journal volume & issue
Vol. 50, no. 1
pp. 185 – 193

Abstract

Read online

Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.

Keywords