IEEE Access (Jan 2018)

Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

  • Shanyun Liu,
  • Rui She,
  • Pingyi Fan

DOI
https://doi.org/10.1109/ACCESS.2018.2859398
Journal volume & issue
Vol. 6
pp. 42851 – 42867

Abstract

Read online

The sample size is a fundamental problem in statistics, which also plays a very important role in data collection for big data scenario, especially in the characterization of data structure. This paper considers this problem from the perspective of message importance by transforming the sampling procedure into the process of collecting message importance. To this end, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable similar to differential entropy and calculate the DMIM for some common distributions. Based on DMIM, this paper proposes a new approach to the required sampling number, where the DMIM deviation is constructed to characterize the process of collecting message importance. In fact, the DMIM deviation is a new criterion to choose sample size to be large enough that the message importance of sample set differs from the whole message importance by no more than the specified amount. In order to visually display that the DMIM deviation can guarantee the statistical performance to some extent, we transformed the difference of message importance into the Kolmogorov-Smirnov statistic. Theoretical analyses and numerical results also demonstrate that the new approach is distribution-free and satisfies the Glivenko-Cantelli theorem, which agrees with the previous results in statistics. Moreover, the connection between message importance and distribution goodness-of-fit is established, which verifies that analyzing the data collection with taking message importance into account is feasible.

Keywords