FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins

Georgios Chatzigeorgakidis; Sophia Karagiorgou; Spiros Athanasiou; Spiros Skiadopoulos

doi:10.1186/s40537-018-0115-x

Journal of Big Data (Feb 2018)

FML-kNN: scalable machine learning on Big Data using k-nearest neighbor joins

Georgios Chatzigeorgakidis,
Sophia Karagiorgou,
Spiros Athanasiou,
Spiros Skiadopoulos

Affiliations

Georgios Chatzigeorgakidis: Department of Informatics and Telecommunications, University of Peloponnese
Sophia Karagiorgou: Harokopio University
Spiros Athanasiou: Institute for the Management of Information Systems, ATHENA R.C.
Spiros Skiadopoulos: Department of Informatics and Telecommunications, University of Peloponnese

DOI: https://doi.org/10.1186/s40537-018-0115-x
Journal volume & issue: Vol. 5, no. 1
pp. 1 – 27

Abstract

Read online

Abstract Efficient management and analysis of large volumes of data is a demanding task of increasing scientific and industrial importance, as the ubiquitous generation of information governs more and more aspects of human life. In this article, we introduce FML-kNN, a novel distributed processing framework for Big Data that performs probabilistic classification and regression, implemented in Apache Flink. The framework’s core is consisted of a k-nearest neighbor joins algorithm which, contrary to similar approaches, is executed in a single distributed session and is able to operate on very large volumes of data of variable granularity and dimensionality. We assess FML-kNN’s performance and scalability in a detailed experimental evaluation, in which it is compared to similar methods implemented in Apache Hadoop, Spark, and Flink distributed processing engines. The results indicate an overall superiority of our framework in all the performed comparisons. Further, we apply FML-kNN in two motivating uses cases for water demand management, against real-world domestic water consumption data. In particular, we focus on forecasting water consumption using 1-h smart meter data, and extracting consumer characteristics from water use data in the shower. We further discuss on the obtained results, demonstrating the framework’s potential in useful knowledge extraction.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords