Implementing OpenMP4.0 for the NVIDIA PTX architecture in GCC compiler
Автор: Monakov A.V., Ivanishin V.A.
Журнал: Труды Института системного программирования РАН @trudy-isp-ran
Статья в выпуске: 4 т.28, 2016 года.
Бесплатный доступ
The paper describes the approach used in implementing OpenMP offloading to NVIDIA accelerators in GCC. Offloading refers to a new capability in OpenMP 4.0 specification update that allows the programmer to specify regions of code that should be executed on an accelerator device that potentially has its own memory space and has an architecture tuned towards highly parallel execution. NVIDIA provides a specification of the abstract PTX architecture for the purpose of low-level, and yet portable, programming of their GPU accelerators. PTX code usually does not use explicit vector (SIMD) computation; instead, vector parallelism is expressed via SIMT (single instruction - multiple threads) execution, where groups of 32 threads are executed in lockstep fashion, with support on hardware level for divergent branching. However, some control flow constructs such as spinlock acquisition can lead to deadlocks, since reconvergence points after branches are inserted implicitly. Thus, our implementation maps logical OpenMP threads to PTX warps (synchronous groups of 32 threads). Individual PTX execution contexts are therefore mapped to logical OpenMP SIMD lanes (this is similar to the mapping used in OpenACC). To implement execution of one logical OpenMP thread by a group of PTX threads we developed a new code generation model that allows to keep all PTX threads active, have their local state (register contents) mirrored, and have side effects from atomic instructions and system calls such as malloc happen only once per warp. This is achieved by executing the original atomic or call instruction under a predicate, and then propagating the register holding the result using the shuffle exchange (shfl) instruction. Furthermore, it is possible to setup the predicate and the source lane index in the shuffle instruction in a way that this sequence has the same effect as just the original instruction inside of SIMD regions. We also describe our implementation of compiler-defined per-warp stacks, which is required to have per-warp automatic storage outside of SIMD regions that allows cross-warp references (normally automatic storage in PTX is implemented via.local memory space which is visible only in the PTX thread that owns it). This is motivated by our use of unmodified OpenMP lowering in GCC where possible, and thus using libgomp routines for entering parallel regions, distribution of loop iterations, etc. We tested our implementation on a set of micro-benchmarks, and observed that there is a fixed overhead of about 100 microseconds when entering a target region, mostly due to startup procedures in libgomp (and notably due to calls to malloc), but for long-running regions where that overhead is small we achieve performance similar to analogous OpenACC and CUDA code.
Compilers, gcc, openmp, cuda, ptx
Короткий адрес: https://sciup.org/14916369
IDR: 14916369 | DOI: 10.15514/ISPRAS-2016-28(4)-10