Determining the Quality of a Dataset in Clustering Terms

Alicja Rachwał; Emilia Popławska; Izolda Gorgol; Tomasz Cieplak; Damian Pliszczuk; Łukasz Skowron; Tomasz Rymarczyk

doi:10.3390/app13052942

Applied Sciences (Feb 2023)

Determining the Quality of a Dataset in Clustering Terms

Alicja Rachwał,
Emilia Popławska,
Izolda Gorgol,
Tomasz Cieplak,
Damian Pliszczuk,
Łukasz Skowron,
Tomasz Rymarczyk

Affiliations

Alicja Rachwał: Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, 20-618 Lublin, Poland
Emilia Popławska: Faculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, Poland
Izolda Gorgol: Faculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, Poland
Tomasz Cieplak: Faculty of Management, Lublin University of Technology, 20-618 Lublin, Poland
Damian Pliszczuk: Netrix S.A. Research and Development Center, 20-704 Lublin, Poland
Łukasz Skowron: Faculty of Management, Lublin University of Technology, 20-618 Lublin, Poland
Tomasz Rymarczyk: Netrix S.A. Research and Development Center, 20-704 Lublin, Poland

DOI: https://doi.org/10.3390/app13052942
Journal volume & issue: Vol. 13, no. 5
p. 2942

Abstract

Read online

The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic k-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords