Data & Policy (Jan 2020)
Semi-supervised machine learning with word embedding for classification in price statistics
Abstract
The Office for National Statistics (ONS) is currently undertaking a substantial research program into using price information scraped from online retailers in the Consumer Prices Index including occupiers’ housing costs (CPIH). In order to make full use of these data, we must classify it into the product types that make up the basket of goods and services used in the current collection. It is a common problem that the amount of labeled training data is limited and it is either impossible or impractical to manually increase the size of the training data, as is the case with web-scraped price data. We make use of a semi-supervised machine learning (ML) method, Label Propagation, to develop a pipeline to increase the number of labels available for classification. In this work, we use several techniques in succession and in parallel to enable higher confidence in the final increased labeled dataset to be used in training a traditional ML classifier. We find promising results using this method on a test sample of data achieving good precision and recall values for both the propagated labels and the classifiers trained from these labels. We have shown that through combining several techniques together and averaging the results, we are able to increase the usability of a dataset with limited labeled training data, a common problem in using ML in real world situations. In future work, we will investigate how this method can be scaled up for use in future CPIH calculations and the challenges this brings.
Keywords