Автоматическое конструирование высокопроизводительных параллельных программ для задач разреженной линейной алгебры в системе LUNA
Автор: Беляев Н.А.
Журнал: Проблемы информатики @problem-info
Рубрика: Параллельное системное программирование и вычислительные технологии
Статья в выпуске: 3 (56), 2022 года.
Бесплатный доступ
В статье описываются разработанные специализированные системные алгоритмы автоматического конструирования параллельной программы по описанию численного алгоритма для задач разреженной линейной алгебры. Разработанные специализированные системные алгоритмы позволяют применять при автоматическом конструировании параллельных программ техники параллельного программирования, которые широко применяются при ручной разработке параллельных программ в данной предметной области. Разработанные системные алгоритмы автоматического конструирования параллельной программы были реализованы в виде модулей, которые были интегрированы в систему параллельного программирования общего назначения LuNA, разрабатываемую в ИВМиМГ СО РАН. Производительность тестовых параллельных программ, автоматически сконструированных системой LuNA, оказалась сравнима с производительностью популярных широко используемых библиотечных реализаций этих алгоритмов разреженной линейной алгебры.
Граф задач, параллельная реализация прсдобуславливатсля, конструирование параллельной программы, параллелизм уровня задач
Короткий адрес: https://sciup.org/143179392
IDR: 143179392 | УДК: 004.4’242 | DOI: 10.24412/2073-0667-2022-3-46-60
Automated construction of highly-efficient parallel programs for sparse linear algebra in LUNA system
Automatic construction of efficient scientific parallel programs for supercomputers is in general a complex problem of system parallel programming. Therefore, various specialized system algorithms for automated parallel programs construction are of use. Of interest are algorithms for automated parallel program construction from a given numerical algorithm description. These algorithms can either be general purpose or specialized. The specialized parallel program construction algorithms are of particular interest. Despite the fact that such specialized system algorithms are applicable only for restricted classes of input numerical algorithms, such specialized algorithms have a significant advantage. These algorithms allow parallel program generators to employ parallel programming techniques which are widely used in manual parallel programming to construct highly efficient parallel programs. LuNA system for automatic construction of distributed parallel programs provides a basis for accumulation of such specialized system algorithms to provide high-quality parallel programs construction in particular subject domains. This system allows automated construction of parallel programs for a distributed memory parallel computer from a given numerical algorithm description written with high-level LuNA language. In this paper, a novel specialized algorithm of parallel programs construction for sparse linear algebra domain is presented. It is applicable for a class of sparse linear algebra algorithms which includes widely used preconditioner algorithms for sparse systems of linear equations. This fact makes the developed specialized system algorithm of practical interest. The presented specialized automated parallel program construction algorithm has been implemented as a specialized run-time system which has been integrated into the LuNA system and the module which has been integrated into the LuNA compiler. In order to integrate the implementation of the developed specialized system algorithm into the LuNA system the following approach is used. The LuNA compiler detects if the input numerical algorithm description written with LuNA language belongs to a class of numerical algorithms for which the LuNA system has a specialized support. In this case the corresponding specialized system algorithms are used to construct the so-called intermediate parallel program. Otherwise the previously developed general purpose parallel program construction algorithms are used. The control program constructs an executable representation of the input numerical algorithm at run-time. The executable representation of the input algorithm is a directed acyclic graph of single assignment variables and operations. This representation is then executed by the specialized run-time system. The specialized run-time system employs parallel programming techniques which are widely used in manual development of parallel preconditioners for sparse matrices. In order to assess the performance of parallel programs constructed using the developed specialized system algorithms, a test parallel program was automatically constructed from the description of sparse-vector multiply algorithm. The performance of the automatically constructed parallel program was compared to the performance of implementation of the same algorithm from Intel MKL and Parallel Sparse BLAS libraries. Experimental results demonstrate that the performance of the automatically generated parallel program is comparable with the performance of the corresponding implementation from Intel MKL (the performance of the automatically constructed parallel program is only 15 % less than the performance of the matrix-vector multiply implementation from Intel MKL) . Also the constructed parallel program outperforms the implementation for Parallel Sparse BLAS library by approximately 10 %. These facts show that the LuNA system is practically applicable for automated construction of high performance distributed parallel programs for supercomputers in the sparse linear algebra field.
Список литературы Автоматическое конструирование высокопроизводительных параллельных программ для задач разреженной линейной алгебры в системе LUNA
- Schenk O., Gartner K., Fichtner W. E cient sparse LU factorization with left-right looking strategy on shared memory multiprocessors // BIT Numerical Mathematics. 2000. Ò. 40. N 1. P. 158-176.
- Yamazaki I., Li X. S. New scheduling strategies and hybrid programming for a parallel rightlooking sparse LU factorization algorithm on multicore cluster systems // 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, 2012. P. 619 630.
- Davis T., Stanley K. Sparse lu factorization of circuit simulation matrices // [Electron. Res.]:www.cise.ufl.edu/davis/techreports/KLU/pp04.pdf.
- Geist G. A., Romine C. H. LU factorization algorithms on distributed-memory multiprocessor architectures // SIAM Journal on Scienti-c and Statistical Computing. 1988. Ò. 9. N 4. P. 639-649.
- Choi J. et al. ScaLAPACK: A portable linear algebra library for distributed memory computers-Design issues and performance // Computer Physics Communications. 1996. Ò. 97. N 1-2. P. 1-15.
- Choi J. et al. Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines //Scientic Programming. 1996. Ò. 5. N 3. P. 173-184.
- Gates M. et al. Slate: Design of a modern distributed and accelerated linear algebra library // Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019. P. 1-18.
- Kurzak J. et al. Designing SLATE: Software for linear algebra targeting exascale. 2017.
- Grigori L., Demmel J. W., Li X. S. Parallel symbolic factorization for sparse LU with static pivoting // SIAM Journal on Scienti c Computing. 2007. Ò. 29. N 3. P. 1289 1314.
- Schenk O. Scalable parallel sparse LU factorization methods on shared memory multiprocessors: dis. ETH Zurich, 2000.
- Schenk O., G artner K., Fichtner W. E cient sparse LU factorization with left-right looking strategy on shared memory multiprocessors // BIT Numerical Mathematics. 2000. Ò. 40. N 1. P. 158-176.
- Toledo S. Locality of reference in LU decomposition with partial pivoting //SIAM Journal on Matrix Analysis and Applications. 1997. Ò. 18. N 4. P. 1065 1081.
- Schenk O., G artner K. Solving unsymmetric sparse systems of linear equations with PARDISO // Future Generation Computer Systems. 2004. Ò. 20. N 3. P. 475-487.
- Schenk O. et al. PARDISO: a high-performance serial and parallel sparse linear solver in semiconductor device simulation // Future Generation Computer Systems. 2001. Ò. 18. N 1. P. 69 78.
- Amestoy P. R. et al. MUMPS: a general purpose distributed memory sparse solver // InternationalWorkshop on Applied Parallel Computing. Springer, Berlin, Heidelberg, 2000. P. 121-130.
- Hysom D., Pothen A. A scalable parallel algorithm for incomplete factor preconditioning // SIAM Journal on Scienti-c Computing. 2001. Ò. 22. N 6. P. 2194-2215.
- Hysom D., Pothen A. E cient parallel computation of ILU (k) preconditioners // Proceedings of the 1999 ACM/IEEE conference on Supercomputing. 1999. P. 29-es.
- Kale L. V., Krishnan S. Charm++ A portable concurrent object oriented system based on C++ // Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications. 1993. P. 91-108.
- Bauer M. et al. Legion: Expressing locality and independence with logical regions // SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012. P. 1-11.
- Slaughter E. et al. Regent: A high-productivity programming language for HPC with logical regions // Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015. P. 1-12.
- Malyshkin V. E., Perepelkin V. A. LuNA fragmented programming system, main functions and peculiarities of run-time subsystem // International Conference on Parallel Computing Technologies. Springer, Berlin, Heidelberg, 2011. P. 53-61.
- Belyaev N., Perepelkin V. High-E ciency Specialized Support for Dense Linear Algebra Arithmetic in LuNA System // International Conference on Parallel Computing Technologies. Springer, Cham, 2021. P. 143-150.
- Bosilca G. et al. Parsec: Exploiting heterogeneity to enhance scalability //Computing in Science & Engineering. 2013. Ò. 15. N 6. P. 36-45.
- Bichot C. E., Siarry P. (ed.). Graph partitioning. John Wiley & Sons, 2013.
- Pellegrini F. Scotch and libScotch 5.1 user's guide. 2008.
- Karypis G., Kumar V. METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing ll-reducing orderings of sparse matrices. 1997.
- Davis T. A., Hu Y. The University of Florida sparse matrix collection //ACM Transactions on Mathematical Software (TOMS). 2011. Ò. 38. N 1. P. 1-25.
- Wang E. et al. Intel math kernel library // High-Performance Computing on the Intel-Xeon PhiTM. Springer, Cham, 2014. P. 167-188.
- [Electron. Res.]: https://www.jscc.ru/
- Filippone S., Colajanni M. PSBLAS: A library for parallel linear algebra computation on sparse matrices // ACM Transactions on Mathematical Software (TOMS). 2000. Ò. 26. N 4. P. 527-550.