A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Yun Li; Yongyao Jiang; Juan Gu; Mingyue Lu; Manzhu Yu; Edward  M. Armstrong; Thomas Huang; David Moroni; Lewis  J. McGibbney; Greguska Frank; Chaowei Yang

doi:10.3390/app9061114

Applied Sciences (Mar 2019)

A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

Yun Li,
Yongyao Jiang,
Juan Gu,
Mingyue Lu,
Manzhu Yu,
Edward M. Armstrong,
Thomas Huang,
David Moroni,
Lewis J. McGibbney,
Greguska Frank,
Chaowei Yang

Affiliations

Yun Li: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA
Yongyao Jiang: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA
Juan Gu: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA
Mingyue Lu: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA
Manzhu Yu: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA
Edward M. Armstrong: NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA
Thomas Huang: NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA
David Moroni: NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA
Lewis J. McGibbney: NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA
Greguska Frank: NASA Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109, USA
Chaowei Yang: NSF Spatiotemporal Innovation Center and Department of Geography and GeoInformation Science, George Mason University, Fairfax, VA 22030, USA

DOI: https://doi.org/10.3390/app9061114
Journal volume & issue: Vol. 9, no. 6
p. 1114

Abstract

Read online

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords