Distributed Computational Experiments in the MLOps Platform of HSE University
Автор: Anton S. Khritankov, Valentin A. Polezhaev, Georgiy A. Zhulikov, Maksim S. Halynchik, Nikita A. Klimin, Kirill E. Sakharov, Viktor O. Minchenkov, Ivan V. Spirin, Ivan I. Krupnov, Sofia F. Yakusheva, Aleksandra S. Maratkanova, Vyacheslav I. Kozyrev, Pavel S. Kostenetskiy, Hadi M. Salekh
Статья в выпуске: 2 т.14, 2025 года.
Бесплатный доступ
Despite the wide spread and successful application of data mining and processing tools for solving individual applied problems, the problem of developing a technology for creating such software tools has not yet been solved. In the context of a unified MLOps process for creating machine learning technologies, this paper considers the emerging problems of automating and executing distributed computing experiments on a hybrid cloud computing platform. The MLOps platform being developed at HSE University is designed to deploy intelligent services and data analysis software. The platform shall manage heterogeneous resources available locally and in the cloud environment and combine them with the resources of the HSE cHARISMa computing cluster managed with Slurm. Thus, relevant is the problem of integrating the specified resources for conducting computational experiments, implementing pipelines for setting up machine learning models, solving problems of data processing and analysis. The features of the problem being solved are the consideration of the computation process as an integral part of the technology for creating intelligent services, the need for using heterogeneous resources for this technology, and the use of the hybrid platform for the execution of computations. The paper proposes a solution to the problem of integrating computations and presents the results of testing the solution for intelligent services. We show the feasibility of such integration of heterogeneous resources in the same computational experiment based on an object model of the experiment extended by the user and a domain-specific language for its specification, and resolve the issues of dynamic management of the deployment of intelligent applications, integration of data processing pipelines, services and data sets for performing distributed computational experiments.
Distributed computing experiments, machine learning, cloud technologies, MLOps
Короткий адрес: https://sciup.org/147250999
IDR: 147250999 | DOI: 10.14529/cmse250203