Uncertainty Based Optimal Sample Selection for Big Data

Saadia Ajmal; Rana Aamir Raza Ashfaq; Kashif Saleem

doi:10.1109/ACCESS.2022.3233598

IEEE Access (Jan 2023)

Uncertainty Based Optimal Sample Selection for Big Data

Saadia Ajmal,
Rana Aamir Raza Ashfaq,
Kashif Saleem

Affiliations

Saadia Ajmal: ORCiD; Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan
Rana Aamir Raza Ashfaq: Department of Computer Science, Bahauddin Zakariya University, Multan, Pakistan
Kashif Saleem: ORCiD; Department of Computer Sciences and Engineering, College of Applied Studies and Community Service, King Saud University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2022.3233598
Journal volume & issue: Vol. 11
pp. 6284 – 6292

Abstract

Read online

In Machine learning and pattern recognition, building a better predictive model is one of the key problems in the presence of big or massive data; especially, if that data contains noisy and unrepresentative data samples. These types of samples adversely affect the learning model and may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample the data after eliminating unnecessary instances by maintaining the underlying distribution intact. This process is called sampling or instance selection (IS). However, in this process, a substantial computational cost is involved. This paper discusses an uncertainty based optimal sample selection (UBOSS) method which can select a subset of optimal samples efficiently. Our proposed work comprises three main steps; initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original data set; then, an uncertainty-based selector is designed to obtain fuzziness (i.e., a type of uncertainty) of those samples using a classifier whose output is a membership or fuzzy vector; this process further utilizes the divide-and-conquer strategy to obtain a subset of representative samples. Experiments are conducted on six datasets to evaluate the performance of the proposed IS method. Results show that our proposed methodology outperforms when compared with the selection performance (i.e., optimum samples) of the baseline methods (i.e., CNN, IB3, and DROP3).

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords