Electronic Research Archive (Jul 2022)

An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers

  • Haoqing Wang ,
  • Wen Yi,
  • Yannick Liu

DOI
https://doi.org/10.3934/era.2022176
Journal volume & issue
Vol. 30, no. 9
pp. 3452 – 3462

Abstract

Read online

Numerical experiment is an essential part of academic studies in the field of transportation management. Using the appropriate sample size to conduct experiments can save both the data collecting cost and computing time. However, few studies have paid attention to determining the sample size. In this research, we use four typical regression models in machine learning and a dataset from transport infrastructure workers to explore the appropriate sample size. By observing 12 learning curves, we conclude that a sample size of 250 can balance model performance with the cost of data collection. Our study can provide a reference when deciding on the sample size to collect in advance.

Keywords