EPJ Web of Conferences (Jan 2019)
Developing a monitoring system for Cloud-based distributed data-centers
Abstract
Nowadays more and more datacenters cooperate each others to achieve a common and more complex goal. New advanced functionalities are required to support experts during recovery and managing activities, like anomaly detection and fault pattern recognition. The proposed solution provides an active support to problem solving for datacenter management teams by providing automatically the root-cause of detected anomalies. The project has been developed in Bari using the datacenter ReCaS as testbed. Big Data solutions have been selected to properly handle the complexity and size of the data. Features like open source, big community, horizontal scalability and high availability have been considered and tools belonging to the Hadoop ecosystem have been selected. The collected information is sent to a combination of Apache Flume and Apache Kafka, used as transport layer, in turn delivering data to databases and processing components. Apache Spark has been selected as analysis component. Different kind of databases have been considered in order to satisfy multiple requirements: Hadoop Distributed File System, Neo4j, InfluxDB and Elasticsearch. Grafana and Kibana are used to show data in a dedicated dashboards. The Root-cause analysis engine has been implemented using custom machine learning algorithms. Finally, results are forwarded to experts by email or Slack, using Riemann.