Mathematical Support for Assessing the Reliability of Software for Computing Clusters

Автор: Aleksandr Nikolaevich Privalov, Aleksey Valerievich Bogomolov, Eugeny Vasilyevich Larkin, Tatyana Alekseevna Akimenko

Журнал: Вестник Южно-Уральского государственного университета. Серия: Математика. Механика. Физика @vestnik-susu-mmph

Рубрика: Математика

Статья в выпуске: 4 т.17, 2025 года.

Бесплатный доступ

The article presents the results of the synthesis of mathematical support for an a priori reliability assessment of software for computing clusters, taking into account the specifics of their design: parallel information processing, high performance, scalability, increased fault tolerance, load balancing, and support for heterogeneous configurations. The developed mathematical support is based on the linearization of the original sequential algorithm and its transformation from a linear form to a parallel one. Linearization of the original structure of the software allows representing the algorithm it implements as a union of sequences of operators that begin and end in non-executable operators. The linearization procedure includes the following stages: representing the algorithm describing sequential data processing as a graph; linearization of the graph by forming a graph of implementations using matrix concatenation of its adjacency matrix; formation of a parallel structure from the linearized model by dividing the sequence of operators into fragments, the number of which is equal to the number of computing clusters. The paper presents the mathematical dependencies for calculating probability estimates for the occurrence of operator sequences that do not satisfy the initial requirements for the algorithm, based on the distribution laws of the processed data and the branching conditions of the computational process. It considers failure models caused by data processing synchronization errors in computing clusters. The study also involved searching for and estimating the probabilities of branches occurring in linear and parallel forms that do not meet the requirements of the software tool's technical specifications. As a result, we constructed a model of a software failure generator that provides an a priori estimate of the mean time to failure. The simulated failure flow is considered as a stochastic sum of Poisson flows formed during multiple cyclic launches of the computing cluster. The mean time between failures of the software is calculated based on the density of the distribution of the time between failures in the Poisson flow. Mathematical models of the failure generator and the calculation of the mean time between failures are synthesized, based on estimates of the probabilities of the parallel computing process entering the failure branches of the implemented algorithms. The priorities for the development of the obtained results are associated with the development of models of software failures of computing clusters due to structural errors made during the development of parallel programs.

Еще

Graph linearization, algorithm branching, parallel computing, mean time between failures, software reliability, reliability modeling, computing cluster software

Короткий адрес: https://sciup.org/147252292

IDR: 147252292   |   УДК: 004.942   |   DOI: 10.14529/mmph250404