Automatic Program Construction with Accelerators Usage Based on the Active Knowledge Concept in LuNA System

Бесплатный доступ

As accelerators are designed for specific types of computations (e. g., GPUs, NPUs, FPGAs), they can significantly improve the performance of corresponding programs. In fact, most of the computing power in maodern multicomputers is provided by accelerators. However, using accelerators in programs requires knowledge of system programming, which makes development more difficult for experts in other domains. For example, it is necessary to manage data transfers between CPUs and accelerators, implement load balancing, and schedule computations on accelerators. Achieving good efficiency requires different optimization methods in different subject domains. One way to reduce the labor costs of developing programs that use accelerators is to offload some complex routine tasks from humans to an automation system. Such systems can lower the required level of competence for developing efficient programs by performing automatic program construction. Potentially, these automation systems can also produce programs with higher efficiency than average hand-coded implementations. Since no universal automation approach exists for such systems, it is necessary to develop various approaches for particular classes of applied problems and types of accelerators. One method that can be applied here is the active knowledge concept [1]. Automatic construction of parallel programs using the active knowledge concept is performed based on computational models (CMs). CMs are designed to represent knowledge in a certain subject domain. For simplicity, a CM can be viewed as a bipartite directed graph consisting of operations and variables. Each variable represents a certain value within the subject domain. Each operation represents the derivation of its output variables (on outgoing arcs) from its input variables (on incoming arcs). Operations are supplied with code fragments (CFs) in the form of conventional subroutines (procedures). To solve a problem in this CM’s subject domain, we define a VW-task on the CM, where V is the set of input variables and W is the set of output variables to be computed. A VW-plan is then derived from the VW-task (if it exists). A VW-plan is a subgraph of the CM containing the necessary operations and variables to compute the variables in W from the variables in V. Using the VW-plan, a program can be constructed that takes values for the variables in V and computes values for the intermediate variables and the variables in W. Such programs can be parallel if different variables can be evaluated independently. The actual program construction (VW-plan derivation and/or program generation) is performed by an active knowledge system. An example of such a system, explored in this paper, is the LuNA system [2, 3]. The LuNA system takes a prepared VW-plan and performs program generation; the generated program is then executed by the runtime system. The runtime system uses a thread pool to schedule and execute operations. The thread pool consists of one queue and multiple worker threads that take available tasks from the queue and execute them. In a distributed system, multiple instances of the runtime system can execute the generated program. These instances communicate to decide which operation will be executed on which node. In this paper, we consider an extension of the LuNA runtime system to support accelerators. In particular, we examine the Huawei Ascend neural processor [4]. We propose a high-level specification called CoFaNA (Code Fragment Notation for Accelerators) for describing computations to be executed on the Ascend accelerator. We also propose an extension of the thread pool to perform computations executed on both CPUs and Ascends. The thread pool was extended to have three queues: one for tasks to be executed on CPUs, another for tasks to be executed on Ascends, and a third for heterogeneous tasks that can be executed on either CPUs or Ascends. We also split the worker threads into two groups: one group can execute tasks on CPUs (including heterogeneous tasks), and the other group can execute tasks on Ascends. Next, based on the implementation of support for the Ascend accelerator, we propose a subsystem that can potentially use any accelerator. By implementing support for a particular accelerator in a separate extension module (plugin), the subsystem can use subprograms from this module during program generation and execution in the runtime system. An extension module implements interfaces for the code generator, context, and thread pool, and also provides metainformation. This metainformation is used during program generation to link necessary shared libraries. The context is used to initialize or deinitialize libraries used in the extension module. The code generator can parse various high-level specifications (for example, the CoFaNA format for the Ascend accelerator). The thread pool can be implemented in various ways to use accelerators more efficiently.

Еще

Active knowledge, LuNA system, accelerator, Huawei Ascend processor, automatic program construction, high-level specification, subsystem for accelerators support

Короткий адрес: https://sciup.org/143185320

IDR: 143185320   |   УДК: 004.4'242   |   DOI: 10.24412/2073-0667-2025-4-73-88