Monitoring applications on the ZHORES cluster at Skoltech

Автор: Igor E. Zacharov, Oleg A. Panarin, Sergey G. Rykovanov, Rishat R. Zagidullin, Anton K. Maliutin, Yuri N. Shkandybin, Assel Ye. Yermekova

Журнал: Программные системы: теория и приложения @programmnye-sistemy

Рубрика: Программное и аппаратное обеспечение распределенных и суперкомпьютерных систем

Статья в выпуске: 2 (49) т.12, 2021 года.

Бесплатный доступ

Standard monitoring tools for cluster computing systems allow assessing the performance of the whole system, but do not allow to analyze the performance of applications individually. A monitoring system for measuring the resources requested by each application separately was written in Skoltech for the high-performance Zhores cluster. The monitoring system collects both, the usual metrics of CPU and GPU utilization, as well as the CPU and GPU event counters which allow a more detailed analysis of the resources requested by the application. Service programs deployed on each node in the cluster send measurements to a common time series database in one second increments. These data are analyzed offline to isolate the characteristics associated with the use of computing resources by each application. This should reveal suboptimal applications, allow fine-tuning of the cluster functions and improve the HPC system overall.

Еще

Cluster, high performance computing, application monitoring, CPU / GPU event counters, time series database.

Короткий адрес: https://sciup.org/143173916

IDR: 143173916   |   DOI: 10.25209/2079-3316-2021-12-2-73-103

Статья научная