AIMS Mathematics (Jan 2024)

Group feature screening for ultrahigh-dimensional data missing at random

  • Hanji He ,
  • Meini Li,
  • Guangming Deng

DOI
https://doi.org/10.3934/math.2024197
Journal volume & issue
Vol. 9, no. 2
pp. 4032 – 4056

Abstract

Read online

Statistical inference for missing data is common in data analysis, and there are still widespread cases of missing data in big data. The literature has discussed the practicability of two-stage feature screening with categorical covariates missing at random (IMCSIS). Therefore, we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS), which can be used to effectively select important features. The proposed method expands the scope of IMCSIS and further improves the performance of classification learning when covariates are missing. Based on the adjusted Pearson chi-square statistics, a two-stage group feature screening method is modeled, and theoretical analysis proves that the proposed method conforms to the sure screening property. In a numerical simulation, GIMCSIS can achieve better finite sample performance under binary and multivariate response variables and multi-classification covariates. The empirical analysis through multiple classification results shows that GIMCSIS is superior to IMCSIS in imbalanced data classification.

Keywords