Метод преобразования императивного кода для платформ параллельной обработки данных
Автор: Симонов В.С., Хайретдинов М.С.
Журнал: Проблемы информатики @problem-info
Рубрика: Параллельное системное программирование и вычислительные технологии
Статья в выпуске: 3 (60), 2023 года.
Бесплатный доступ
Существует множество платформ для обработки данных, которые позволяют последовательным программам получать доступ к возможностям параллельной обработки. Чтобы извлечь выгоду из преимуществ таких платформ, существующий код приходится переписывать на языки, специфичные для конкретной предметной области, которые поддерживает каждая платформа. Данное преобразование - утомительный и подверженный ошибкам процесс - также требует от разработчиков выбора нужной платформы, которая оптимизирует производительность с учетом конкретной рабочей нагрузки. В данной статье описывается формальный метод, результатом применения которого на императивном коде являются эквивалентные инструкции, пригодные для исполнения в системе параллельной обработки данных, например, Hadoop, реализующей парадигму MapReduce. Метод применяется для вывода высокоуровневой сводки, выраженной на нашем языке спецификации программы, которая затем компилируется для выполнения в Apache Spark [1]. Было показано, что метод позволяет преобразовать императивный код в пригодный для исполнения на платформе Apache Spark. Приведенные результаты выполняются в среднем в 3,3 раза быстрее, чем последовательные реализации, а также лучше масштабируются для больших наборов данных.
Императивный код, параллельная обработка данных
Короткий адрес: https://sciup.org/143181008
IDR: 143181008 | УДК: 004.89 | DOI: 10.24412/2073-0667-2023-3-68-80
Imperative code conversion method for parallel data processing platforms
There are many data processing platforms that allow sequential programs to access parallel processing capabilities. To benefit from the advantages of such platforms, existing code has to be rewritten into domain-specific languages that each platform supports. This transformation, a tedious and error-prone process, also requires developers to choose the right platform that optimizes performance based on a specific workload. This article describes a formal method, the result of which on imperative code is code suitable for execution in a parallel data processing system, for example, Hadoop, implementing the MapReduce paradigm. Given a sequential code fragment, a method is used to output a high-level summary expressed in our the language of the program specification, which is then compiled for execution in Apache Spark [1]. We demonstrate that the method allows you to convert imperative code into suitable for execution on the Apache Spark platform. Translated results are executed 1.3 times faster on average than sequential implementations, and also scale better for large datasets. As computing becomes more ubiquitous, storage becomes cheaper, and data collection tools become more sophisticated, more data is being collected today than ever before. Data-driven advances are becoming increasingly common in various scientific fields. Thus, efficient analysis and processing of huge data sets is a huge computational task. For processing very large data sets, many parallel data processing platforms have been developed [1-5], and new ones continue to be developed [5-7]. Most parallel data processing frameworks come with domain-specific optimizations , which are provided either through the library application programming interface (API) [1-4, 6, 7], or using a high-level domain-specific language: domain-specific language (DSL), so that users can express their calculations [5, 8]. Calculations expressed using such API or DSL calls are more efficient due to the optimization of platforms for a specific domain [8-11]. However, many of the problems associated with this approach often make frameworks related to a specific subject area inaccessible to non-spccialists, such as researchers studying physical or social sciences. First, domain-specific optimization for various workloads requires an expert to determine in advance the most appropriate structure for a given piece of code. Secondly, end users often have to learn new APIs or DSLs [1-3, 6, 7, 12] and transform existing code to take advantage of the advantages provided by some platforms. This requires not only considerable time and resources, but is also fraught with errors in the code. Moreover, even users who want to transform their applications must first understand the purpose of the code that could have been written by others, and manually written low-level optimizations in the code often hide high-level intentions. Finally, even after learning new APIs and rewriting code, newly emerging frameworks often turn newly written code into outdated applications. Users then have to repeat this process in order to keep up with new advances, which requires considerable time, which would be better spent on promoting scientific discoveries. One way to improve the availability of parallel data processing platforms involves creating compilers that automatically convert applications written in common general-purpose languages (such as C, Java or Python) into high-performance parallel processing applications such as Hadoop or Spark. These compilers allow users to write their applications in familiar general-purpose languages and allow the compiler to reassign parts of their code to high-performance DSL [13-15]. Then applications can use the performance of these specialized frameworks without the additional cost of learning to program individual DSLs. But such compilers do not exist for all cases, and their creation can be very difficult.
Список литературы Метод преобразования императивного кода для платформ параллельной обработки данных
- Apache Spark. [Electron res.]: https://spark.apache.org. Accessed: 2023-01-19.
- Apache Hadoop. [Electron res.]: http://hadoop.apache.org. Accessed: 2023-01-19.
- Apache Storm. [Electron res.]: http://storm.apache.org. Accessed: 2023-01-19.
- GraphLab Create. [Electron res.]: https://dato.com/. Accessed: 2023-01-20.
- MongoDB [Electron res.]: https://www.mongodb.org. Accessed: 2023-01-19.
- Akidau T., Bradshaw R., Chambers C., Chernyak S., Fernandez-Moctezuma R. J., Lax R., McVeety S., Mills D., Berry F., Schmidt E., Whittle S. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing // Proceedings of the VLDB Endowment 8, 2015. P. 1792-1803.
- TensorFlow. [Electron res.]: http://tensorflow.org/. Accessed: 2023-01-20.
- Ragan-Kelley J., Barnes C., Adams A., Paris S., Durand F., Amarasinghe S. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines // Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013. PLDIT3, ACM, New York, NY, USA. P. 519-530, DOL 10.1145/2491956.2462176.
- Apache Hive. [Electron res.]: http://hive.apache.org. Accessed: 2023-01-20.
- Solar-Lezama A., Arnold G., Tancau L., Bodik R., Saraswat V., Seshia S. Sketching Stencils // Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2007. PLDI ’07, ACM, New York, NY, USA. P. 167-178, DOI: 10.1145/1273442.1250754.
- Arvind K. Sujeeth A.K., Kevin J. Brown K. J., Lee H., Rompf T., Chafi H., Odersky M., Olukotun K. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific. 2014.
- Hoare С. A. R. An Axiomatic Basis for Computer Programming // Communications of the ACAI 12(10), 1969. P. 576-580, DOI: 10.1145/363235.363259.
- Cheung A., Solar-Lezama A., Madden S. Optimizing Database-backed Applications with Query Synthesis // Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013. PLDI ’13, ACM, New York, NY, USA. P. 3-14, DOL 10.1145/2491956.2462180.
- Kamil S., Cheung A., Itzhaky S., Solar-Lezama A. Verified Lifting of Stencil Computations // SIGPLAN Not. 2016. 51(6), P. 711-726, DOL 10.1145/2980983.2908117.
- Radoi C., Fink S. J., Rabbah R., Sridharan M. Translating Imperative Code to MapReduce // Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA T4, 2014. ACM, New York, NY, USA. P. 909-927, DOL 10.1145/2660193.2660228.
- Ernst M. D., Perkins J.H., Guo P. J., McCamant S., Pacheco C., Tschantz M.S., Xiao C. The Daikon System for Dynamic Detection of Likely Invariants. Sci. Comput. Program. 2007.
- Srivastava S., Gulwani S. Program Verification Using Templates over Predicate Abstraction // Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, ‘09, 2009. ACM, New York, NY, USA. P. 223-234, DOL 10.1145/1542476.1542501.