Automated construction of highly-efficient parallel programs for sparse linear algebra in LUNA system

Бесплатный доступ

Automatic construction of efficient scientific parallel programs for supercomputers is in general a complex problem of system parallel programming. Therefore, various specialized system algorithms for automated parallel programs construction are of use. Of interest are algorithms for automated parallel program construction from a given numerical algorithm description. These algorithms can either be general purpose or specialized. The specialized parallel program construction algorithms are of particular interest. Despite the fact that such specialized system algorithms are applicable only for restricted classes of input numerical algorithms, such specialized algorithms have a significant advantage. These algorithms allow parallel program generators to employ parallel programming techniques which are widely used in manual parallel programming to construct highly efficient parallel programs. LuNA system for automatic construction of distributed parallel programs provides a basis for accumulation of such specialized system algorithms to provide high-quality parallel programs construction in particular subject domains. This system allows automated construction of parallel programs for a distributed memory parallel computer from a given numerical algorithm description written with high-level LuNA language. In this paper, a novel specialized algorithm of parallel programs construction for sparse linear algebra domain is presented. It is applicable for a class of sparse linear algebra algorithms which includes widely used preconditioner algorithms for sparse systems of linear equations. This fact makes the developed specialized system algorithm of practical interest. The presented specialized automated parallel program construction algorithm has been implemented as a specialized run-time system which has been integrated into the LuNA system and the module which has been integrated into the LuNA compiler. In order to integrate the implementation of the developed specialized system algorithm into the LuNA system the following approach is used. The LuNA compiler detects if the input numerical algorithm description written with LuNA language belongs to a class of numerical algorithms for which the LuNA system has a specialized support. In this case the corresponding specialized system algorithms are used to construct the so-called intermediate parallel program. Otherwise the previously developed general purpose parallel program construction algorithms are used. The control program constructs an executable representation of the input numerical algorithm at run-time. The executable representation of the input algorithm is a directed acyclic graph of single assignment variables and operations. This representation is then executed by the specialized run-time system. The specialized run-time system employs parallel programming techniques which are widely used in manual development of parallel preconditioners for sparse matrices. In order to assess the performance of parallel programs constructed using the developed specialized system algorithms, a test parallel program was automatically constructed from the description of sparse-vector multiply algorithm. The performance of the automatically constructed parallel program was compared to the performance of implementation of the same algorithm from Intel MKL and Parallel Sparse BLAS libraries. Experimental results demonstrate that the performance of the automatically generated parallel program is comparable with the performance of the corresponding implementation from Intel MKL (the performance of the automatically constructed parallel program is only 15 % less than the performance of the matrix-vector multiply implementation from Intel MKL) . Also the constructed parallel program outperforms the implementation for Parallel Sparse BLAS library by approximately 10 %. These facts show that the LuNA system is practically applicable for automated construction of high performance distributed parallel programs for supercomputers in the sparse linear algebra field.

Еще

Dag, parallel preconditioner, parallel program synthesis, fragmented programming, task based parallelism, parallel run-time system, parallel programming system

Короткий адрес: https://sciup.org/143179392

IDR: 143179392   |   DOI: 10.24412/2073-0667-2022-3-46-60

Статья научная