Coordinated checkpointing with sender-based logging and asynchronous recovery from failure
Автор: Bondarenko A.A., Lyakhov P.A., Yakobovskiy M.V.
Статья в выпуске: 2 т.8, 2019 года.
Бесплатный доступ
The increasing growth in the number of components of supercomputers leads HPC specialists to unfavorable estimates for future supercomputers: “the range of the mean time between failures will be from 1 hour to 9 hours.” This estimate leads to the problem of long calculations on supercomputers. In this paper, we propose a recovery method from failure which does not require rollback for all processes. This method can reduce overhead costs for some computational algorithms. The standard fault tolerance method consists of two phases: coordinated checkpointing and rollback of all processes to the last checkpoint in the case of a failure. The proposed method includes coordinated checkpointing with sender-based logging and asynchronous recovery when most processes wait and several processes recalculate the lost data. We developed parallel programs to solve the problem of heat transfer in the thin plate which computation algorithm has a small amount of data for logging. In these programs, failures occur by calling the function raise(SIGKILL), coordinated or asynchronous recovery is performed by ULFM functions. In order to obtain theoretical estimates of overhead costs, we propose a simulation model of program execution with failures. This model assumes that failures strike during the computations, checkpointing and recovery. We made a comparison of recovery methods with different failure rates. The comparison showed that the use of asynchronous recovery results in a reduction of overhead costs by theoretical estimates from 22% to 40%, and by computational experiments from 13% to 53%.
Mpi, расширение ulfm, ulfm extension, coordinated checkpointing, asynchronous recovery, fault tolerance
Короткий адрес: https://sciup.org/147233197
IDR: 147233197 | DOI: 10.14529/cmse190205