Automating debugging and load balancing in fragmented programs

Основное

Автор: Vlasenko Andrey, Michurov Mikhail, Mustafin Damir

Журнал: Проблемы информатики @problem-info

Рубрика: Параллельное системное программирование и вычислительные технологии

Статья в выпуске: 3 (56), 2022 года.

Бесплатный доступ

The LuNA (Language for Numerical Algorithms) fragmented programming system is a high-level tool for creating parallel applications. Programs created using the LuNA can be run on computing systems with shared or distributed memory. The users create their program using a language resembling the standard imperative programming languages (C/C++, Fortran, Pascal, etc.), but does not specify any constructions regarding parallelism. Each operator of the LuNA language causes the generation of a new calculation fragment (CF) at runtime, which accepts objects called data fragments (DF) as its input and output. The totality of CFs makes up a fragmented program at the execution stage. Thus, the algorithm of a fragmented program is represented as a bipartite oriented graph with vertices of types CF and DF. The paper explains the important properties of fragmented programs: the non- deterministic order of operators execution in procedures, the uniqueness of assigning DF values and the absence of arrays in the programmer’s usual understanding. The most typical logical errors for fragmented programs are described, which include an attempt to use uninitialized data, repeated initialization of the same data and a mismatch of data types during initialization and during use. In the case of running a fragmented program on a distributed memory computing system, the processes share the work of executing fragments of calculations. During this work, a situation of computational load imbalance may occur. The paper substantiates the importance of solving this problem. Since LuNA is a high-level parallel programming tool, it is very difficult for the user to solve debugging and load balancing problems. In this regard, the authors of paper develop an automated debugging module based on the ,,post-mortem“ analysis method and an automatic centralized load balancing module for the system. The automated debugging module collects trace files in JSON format during the execution of a fragmented program on each process. These files record information about the processed CFs, as well as their input and output DFs. After the normal or emergency shutdown of the program, the user can call a special software tool (luna_trace) that analyzes the collected traces. The result of its work is the output of detailed information about the detected errors, including a verbal description of the problem, a program statement indicating the string and file name of the source code, a stack of procedure calls and the names of the DFs with which the error is associated. In addition, information specific to a particular error is displayed, for example, for uninitialized DFs - the places of their possible initialization. Another important part of this module functionality is the automatic detection of program hangs. The implementation of hang detection is based on a modification of the distributed Dijkstra-Scholtcn algorithm. A hang-up situation occurs when one or more errors from frequently occurring classes are made in the LuNA program. In this regard, in the environment of a supercomputer center, where the user loses the ability to interact with the program after queuing the task, this functionality becomes particularly important. The paper explains the fundamental difference between dynamic and static load balancing approaches and outlines the applicability of both approaches for different cases. The importance of developing a dynamic load balancing module in the LuNA system is explained. There are 2 methods of organizing dynamic balancing: based on centralized and distributed algorithms. The module being created implements the first method. The centralized load balancing module launches a service load balancer process along with the „worker processes" that execute CFs. The role of this service process is to collect information from worker processes about executed and ready-to-execute CFs in order to redistribute them and minimize the imbalance. By collecting this information, the load balancer monitors the relative value of the imbalance between the most and the least loaded processes. When this imbalance reaches the threshold value (this threshold is setting by the parameter), the load balancer generates a „balancing plan" deciding which CFs should be transferred from one process to another. The balancing plan is based on the fragment’s „weights". The execution time of the fragment is taken as the weight. At the same time, if a fragment is transferred from one process to another, then its weight increases by the time of transferring. The paper shows the results of modules testing on the computing cluster of Novosibirsk State University (NSU). Tests were performed on the problem of block matrix multiplication. The presented results demonstrate the effectiveness of the centralized load balancing module and acceptable overhead costs of the automated debugging module. Further development plans regarding the functionality of the modules are given at the end.

Еще

Fragmented programming, LuNA system, automated debugging, dynamic load balancing

Короткий адрес: https://sciup.org/143179394

IDR: 143179394 | DOI: 10.24412/2073-0667-2022-3-61-76

Статья научная