IEEE Access (Jan 2020)

Using Partial Combination Models to Improve Prediction Quality and Transparency in Mixed Datasets

  • Yi-Hsin Wu,
  • Yu-Hsin Chang,
  • Yin-Jing Tien,
  • Cheng-Juei Yu,
  • Sheng-De Wang,
  • Cheng-Hung Wu

DOI
https://doi.org/10.1109/ACCESS.2020.3008475
Journal volume & issue
Vol. 8
pp. 132106 – 132120

Abstract

Read online

Mixed Datasets with complex interactions between categorical and numerical attributes are common in engineering and business applications. For example, production rates in manufacturing systems are jointly influenced by several categorical and numerical attributes, such as machine and product types and their numerical attributes. This study aims to improve the prediction performance and transparency of mixed datasets with complex interactions using machine learning (ML) methods. The proposed method requires lesser data and computational effort than existing hierarchical or clustering regression methods. Multiple prediction models can be generated by partitioning a dataset into subsets with different categorical attribution combinations. One- and two-stage model selection methods are proposed to use the training and validation datasets in selecting better models among all the prediction models. Numerical results demonstrate the potential of the model selection approach in a mixed dataset from a semiconductor manufacturer. In comparison with regression models, more than 30% reduction in root mean square error is observed using the proposed model selection approach. The cross-validation test results also demonstrated a 10% improvement in accuracy against the properly tuned XGBoost models. Moreover, the proposed model selection approach is compatible with other regression or ML prediction methods and can be used to improve the model's transparency of any existing methods on mixed datasets.

Keywords