Journal of Edge Computing (May 2024)

Telemetry to solve dynamic analysis of a distributed system

  • Oleh V. Talaver,
  • Tetiana A. Vakaliuk

DOI
https://doi.org/10.55056/jec.728
Journal volume & issue
Vol. 3, no. 1

Abstract

Read online

In the modern software development world, implementing distributed solutions has become quite common due to the flexibility it brings to big companies. The downside is that when developing such systems, especially in many teams, global design problems may not be obvious and lead to a slowdown in the development process or even problems with the location of errors or degradation of overall system performance. In addition, the timely reaction to system degradation is complicated by the distributed nature of the architecture; while manually configuring rules for reporting problematic situations can be time-consuming and still incomplete, automatic detection of possible system anomalies will give engineers (especially Software Reliability Engineers) the focus on problems. For this reason, applications that can dynamically analyse the system for problems have great potential. Currently, the topic of using telemetry for system analysis is actively studied and gaining traction, so further research is valuable. The work aims to theoretically and practically prove the possibility of using telemetry to analyse a distributed information system and detect harmful architectural practices and anomalous events. To do this, firstly, a detailed overview of the problems related to the topic and the feasibility of using telemetry is provided; the next section briefly describes the history of the development of monitoring systems and the key points of the latest OpenTelemetry standard, reviews popular application performance monitoring systems, and defines innovative features to be further researched. The main part includes an explanation of the approach used to collect and process telemetry, a reasoning behind the usage of Neo4j as a data storage solution, a practical overview of graph theory algorithms that help in the analysis of the collected data, and a description outlining how the PCA algorithm is employed to detect unusual situations in the whole system instead of individual metrics. The results provide an example of using the software presented with Neo4j Bloom to visualise and analyse the data collected over several hours from the OpenTelemetry Demo test system. The last section contains additional remarks on the results of the study.

Keywords