Journal of Statistical Software (Feb 2018)

kamila: Clustering Mixed-Type Data in R and Hadoop

  • Alexander H. Foss,
  • Marianthi Markatou

DOI
https://doi.org/10.18637/jss.v083.i13
Journal volume & issue
Vol. 83, no. 1
pp. 1 – 44

Abstract

Read online

In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (ModhaSpangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets.

Keywords