Study on Short Text Classification with Imperfect Labels

LIANG Haowei, WANG Shi, CAO Cungen

doi:10.11896/jsjkx.211100278

Jisuanji kexue (Jan 2023)

Study on Short Text Classification with Imperfect Labels

LIANG Haowei, WANG Shi, CAO Cungen

Affiliations

LIANG Haowei, WANG Shi, CAO Cungen: Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China

DOI: https://doi.org/10.11896/jsjkx.211100278
Journal volume & issue: Vol. 50, no. 1
pp. 185 – 193

Abstract

Read online

Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.

imperfect multi-classification label system|fine-grained short text classification|class labeling|data cleaning

Published in Jisuanji kexue

ISSN: 1002-137X (Print)
Publisher: Editorial office of Computer Science
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software; Technology: Technology (General)
Website: http://www.jsjkx.com/CN/1002-137X/home.shtml

About the journal

Abstract

Keywords