Monitoring applications on the ZHORES cluster at Skoltech
Автор: Igor E. Zacharov, Oleg A. Panarin, Sergey G. Rykovanov, Rishat R. Zagidullin, Anton K. Maliutin, Yuri N. Shkandybin, Assel Ye. Yermekova
Журнал: Программные системы: теория и приложения @programmnye-sistemy
Рубрика: Программное и аппаратное обеспечение распределенных и суперкомпьютерных систем
Статья в выпуске: 2 (49) т.12, 2021 года.
Бесплатный доступ
Standard monitoring tools for cluster computing systems allow assessing the performance of the whole system, but do not allow to analyze the performance of applications individually. A monitoring system for measuring the resources requested by each application separately was written in Skoltech for the high-performance Zhores cluster. The monitoring system collects both, the usual metrics of CPU and GPU utilization, as well as the CPU and GPU event counters which allow a more detailed analysis of the resources requested by the application. Service programs deployed on each node in the cluster send measurements to a common time series database in one second increments. These data are analyzed offline to isolate the characteristics associated with the use of computing resources by each application. This should reveal suboptimal applications, allow fine-tuning of the cluster functions and improve the HPC system overall.
Cluster, high performance computing, application monitoring, CPU / GPU event counters, time series database.
Короткий адрес: https://sciup.org/143173916
IDR: 143173916 | DOI: 10.25209/2079-3316-2021-12-2-73-103