Sensors (Aug 2024)

Study on Data Preprocessing for Machine Learning Based on Semiconductor Manufacturing Processes

  • Ha-Je Park,
  • Yun-Su Koo,
  • Hee-Yeong Yang,
  • Young-Shin Han,
  • Choon-Sung Nam

DOI
https://doi.org/10.3390/s24175461
Journal volume & issue
Vol. 24, no. 17
p. 5461

Abstract

Read online

Various data types generated in the semiconductor manufacturing process can be used to increase product yield and reduce manufacturing costs. On the other hand, the data generated during the process are collected from various sensors, resulting in diverse units and an imbalanced dataset with a bias towards the majority class. This study evaluated analysis and preprocessing methods for predicting good and defective products using machine learning to increase yield and reduce costs in semiconductor manufacturing processes. The SECOM dataset is used to achieve this, and preprocessing steps are performed, such as missing value handling, dimensionality reduction, resampling to address class imbalances, and scaling. Finally, six machine learning models were evaluated and compared using the geometric mean (GM) and other metrics to assess the combinations of preprocessing methods on imbalanced data. Unlike previous studies, this research proposes methods to reduce the number of features used in machine learning to shorten the training and prediction times. Furthermore, this study prevents data leakage during preprocessing by separating the training and test datasets before analysis and preprocessing. The results showed that applying oversampling methods, excluding KM SMOTE, achieves a more balanced class classification. The combination of SVM, ADASYN, and MaxAbs scaling showed the best performance with an accuracy and GM of 85.14% and 72.95%, respectively, outperforming all other combinations.

Keywords