大数据 (Jul 2023)
Research on iterative data cleaning of human-computer interaction
Abstract
The advancement of data collection technology has led to a rapid increase in the size of datasets.Due to the big scale and high complexity of the data volume, serious data quality issues arise.Therefore, data cleaning is a necessary and important step in data activities.To effectively reduce human annotation costs while ensuring the accuracy of cleaning, an iterative data cleaning method (IDCHI) with human participation was proposed.This method proposed a data selection optimization method in the detection module, which enables the classifier to have high accuracy in the initial stage; and further proposed a method for selecting data to be manually annotated, effectively reducing the amount of data to be manually annotated.The experimental results show that the proposed method is effective and efficient in cleaning erroneous data.