Imperative code conversion method for parallel data processing platforms

Основное

Автор: Simonov V.S., Khairetdinov M.

Журнал: Проблемы информатики @problem-info

Рубрика: Параллельное системное программирование и вычислительные технологии

Статья в выпуске: 3 (60), 2023 года.

Бесплатный доступ

There are many data processing platforms that allow sequential programs to access parallel processing capabilities. To benefit from the advantages of such platforms, existing code has to be rewritten into domain-specific languages that each platform supports. This transformation, a tedious and error-prone process, also requires developers to choose the right platform that optimizes performance based on a specific workload. This article describes a formal method, the result of which on imperative code is code suitable for execution in a parallel data processing system, for example, Hadoop, implementing the MapReduce paradigm. Given a sequential code fragment, a method is used to output a high-level summary expressed in our the language of the program specification, which is then compiled for execution in Apache Spark [1]. We demonstrate that the method allows you to convert imperative code into suitable for execution on the Apache Spark platform. Translated results are executed 1.3 times faster on average than sequential implementations, and also scale better for large datasets. As computing becomes more ubiquitous, storage becomes cheaper, and data collection tools become more sophisticated, more data is being collected today than ever before. Data-driven advances are becoming increasingly common in various scientific fields. Thus, efficient analysis and processing of huge data sets is a huge computational task. For processing very large data sets, many parallel data processing platforms have been developed [1-5], and new ones continue to be developed [5-7]. Most parallel data processing frameworks come with domain-specific optimizations , which are provided either through the library application programming interface (API) [1-4, 6, 7], or using a high-level domain-specific language: domain-specific language (DSL), so that users can express their calculations [5, 8]. Calculations expressed using such API or DSL calls are more efficient due to the optimization of platforms for a specific domain [8-11]. However, many of the problems associated with this approach often make frameworks related to a specific subject area inaccessible to non-spccialists, such as researchers studying physical or social sciences. First, domain-specific optimization for various workloads requires an expert to determine in advance the most appropriate structure for a given piece of code. Secondly, end users often have to learn new APIs or DSLs [1-3, 6, 7, 12] and transform existing code to take advantage of the advantages provided by some platforms. This requires not only considerable time and resources, but is also fraught with errors in the code. Moreover, even users who want to transform their applications must first understand the purpose of the code that could have been written by others, and manually written low-level optimizations in the code often hide high-level intentions. Finally, even after learning new APIs and rewriting code, newly emerging frameworks often turn newly written code into outdated applications. Users then have to repeat this process in order to keep up with new advances, which requires considerable time, which would be better spent on promoting scientific discoveries. One way to improve the availability of parallel data processing platforms involves creating compilers that automatically convert applications written in common general-purpose languages (such as C, Java or Python) into high-performance parallel processing applications such as Hadoop or Spark. These compilers allow users to write their applications in familiar general-purpose languages and allow the compiler to reassign parts of their code to high-performance DSL [13-15]. Then applications can use the performance of these specialized frameworks without the additional cost of learning to program individual DSLs. But such compilers do not exist for all cases, and their creation can be very difficult.

Еще

Imperative code, parallel data processing

Короткий адрес: https://sciup.org/143181008

IDR: 143181008 | DOI: 10.24412/2073-0667-2023-3-68-80

Статья научная