IEEE Access (Jan 2021)

Sample Contribution Pattern Based Big Data Mining Optimization Algorithms

  • Xiaodong Shi,
  • Yang Liu

DOI
https://doi.org/10.1109/ACCESS.2021.3060785
Journal volume & issue
Vol. 9
pp. 32734 – 32746

Abstract

Read online

As is the case in many big data mining scenarios with a large scale of samples, the heavy computation cost hinders the application of machine learning, which has to iteratively compute by passing over the whole dataset without considering the roles of different samples in training computation. However, we argue that most of the samples dominating computation resources contribute little to the gradient-based model update, particularly when the model is close to convergence. We define this observation as the Sample Contribution Pattern (SCP) in machine learning. This paper proposes two approaches to exploit SCP by detecting gradient characteristics and triggering the reuse of outdated gradients. In particular, this paper reports research results in (1) the definition and description of SCP to reveal an intrinsic gradient contribution pattern of different samples; (2) a novel SCP-based optimizing algorithm (SCPOA) that outperforms alternative tested algorithms in terms of computation overhead; (3) a variant of SCPOA that incorporates discarding-recovering mechanisms to carefully tradeoff between model accuracy and computation cost; (4) the implementation and evaluation of two algorithms based on popular distributed big data mining platforms running typical sample-sets; (5) intuitive convergence proof of the algorithms. Our experimental results illustrate that the proposed approaches can significantly reduce the computation cost with competitive accuracy.

Keywords