A New Heuristic for the Data Clustering Problem

Umair F. Siddiqi; Sadiq M. Sait

doi:10.1109/ACCESS.2017.2691412

IEEE Access (Jan 2017)

A New Heuristic for the Data Clustering Problem

Umair F. Siddiqi,
Sadiq M. Sait

Affiliations

Umair F. Siddiqi: Center of Communications and IT Reserach, Research Institute, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia
Sadiq M. Sait: ORCiD; Department of Computer Engineering and the Center of Communications and IT Research, Research Institute, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2017.2691412
Journal volume & issue: Vol. 5
pp. 6801 – 6812

Abstract

Read online

This paper presents a new heuristic for the data clustering problem. It comprises two parts. The first part is a greedy algorithm, which selects the data points that can act as the centroids of well-separated clusters. The second part is a single-solution-based heuristic, which performs clustering with the objective of optimizing a cluster validity index. Single-solution-based heuristics are memory efficient as compared with population-based heuristics. The proposed heuristic is inspired from evolutionary algorithms (EAs) and consists of five main components: 1) genes; 2) fitness of genes; 3) selection; 4) mutation operation; and 5) diversification. The attributes of the centroids of clusters are considered as genes. The fitness of a gene is a function of two factors: 1) difference between its value and the same attribute of the mean of the data points assigned to its cluster and 2) the frequency with which it has been mutated in previous iterations. The genes that have low fitness values should be updated through the mutation operation. The mutation operation performs small change (positive or negative) in the value of the gene. The mutants are accepted if they are better (with respect to objective function) than their parents. However, diversification in the search process is maintained by allowing, with a small probability, the mutants to replace their parents even they are not better than them. The objective functions used in the proposed heuristic are Calinski Harabasz index and Dunn index. The proposed algorithm has been experimented using real-life numeric data sets of UCI repository. The number of data points and number of attributes in the datasets lie between 150-11 000 and 4-60, respectively. The results indicate that the proposed algorithm performs better than two standard EAs: 1) simulated annealing algorithm and 2) differential evolution algorithm and a genetic algorithm-based clustering method.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords