kamila: Clustering Mixed-Type Data in R and Hadoop

Alexander H. Foss; Marianthi Markatou

doi:10.18637/jss.v083.i13

Journal of Statistical Software (Feb 2018)

kamila: Clustering Mixed-Type Data in R and Hadoop

Alexander H. Foss,
Marianthi Markatou

Affiliations

Alexander H. Foss
Marianthi Markatou

DOI: https://doi.org/10.18637/jss.v083.i13
Journal volume & issue: Vol. 83, no. 1
pp. 1 – 44

Abstract

Read online

In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets.

Published in Journal of Statistical Software

ISSN: 1548-7660 (Online)
Publisher: Foundation for Open Access Statistics
Country of publisher: United States
LCC subjects: Social Sciences: Statistics
Website: http://www.jstatsoft.org/

About the journal

Abstract

Keywords