Implementation of searching for the most frequent DNA sequences using the Kokkos library

Автор: Kozlov M., Panova E., Meyerov I.

Журнал: Проблемы информатики @problem-info

Рубрика: Прикладные информационные технологии

Статья в выпуске: 2 (63), 2024 года.

Бесплатный доступ

Nowadays, the wide variety of existing architectures raises the problem of developing universal approaches to programming. Various frameworks enable single-source code creation for multiple devices, for example, CPU, GPU, FPGA. Such frameworks include OpenCL, OpenACC, Kokkos, Alpaka and others. However, the problem of efficiency and performance portability remains relevant. It is not always possible to create one code that works efficiently on different devices because of their specific architectures. This article discusses performance aspects in relation to the Kokkos library, a widely used framework for creating cross-platform code. As a benchmark, we consider a bioinformatics problem to find the most frequent DNA sequences of certain length. It is assumed that important genetic information can be encrypted in such sequences. DNA sequence can be represented as a string consisting of four characters “A”, “C”, “G”, “T”, which denote corresponding nucleobases. Therefore, the problem reduces to counting fixed-length patterns in DNA and can be solved using existing string matching algorithms. Faro and Lecroq (2013) reviewed and classified exact string matching algorithms and experimentally evaluated them on different kinds of texts. Hakak et al. (2019) showed the latest advancements in the field of string matching algorithms and designated modern trends and challenges. They analyzed various classes of algorithms and drew conclusions about the limitations and effectiveness of different string matching algorithms for various applications. In this article, we have chosen two algorithms for consideration: the well-known Rabin- Karp algorithm and the Hash3 algorithm from the Hash# family [Lecroq 2007]. The Hash3 algorithm is one of the most effective algorithms for short-length patterns of approximately 8 to 128 characters long. Both these algorithms are based on hashing and are well applicable for genome analysis. For verification and comparison, we also consider a simple naive algorithm based on sequential pattern matching. The naive algorithm consists of character-by-character comparison of all fixed-length patterns. This algorithm is not effective enough, but has great potential for parallelization. We received an acceleration of up to 35 times when ported the parallel naive algorithm from CPU to GPU. The Rabin- Karp algorithm allows us to eliminate character-by-character comparisons effectively using hashing and shows better efficiency compared to the naive algorithm on both CPU and GPU. Our cross-platform parallel implementation of the Rabin-Karp algorithm is approximately 1.25 times faster than the naive algorithm on CPU and 2 times faster on GPU. The Hash3 algorithm cuts off character-by-character comparisons extremely efficiently. Because of this, the Hash3 algorithm is an order of magnitude faster than the naive algorithm. Due to the almost absence of character-by-character comparisons, the algorithm is memory bound and has less potential for parallelization. The Hash3 algorithm was accelerated by 7 times on GPU relative to CPU. We implement these algorithms for CPU and GPU using OpenMP, Cuda and Kokkos technologies. We demonstrate that when using Kokkos with a naive algorithm, the performance loss does not exceed 10 % relative to the OpenMP version. Losses are caused by the compiler making more efficient use of SIMD calculations in the OpenMP implementation when matching patterns. There is no performance loss for the Rabin-Karp and Hash3 algorithms when porting the OpenMP version to Kokkos. Speedup of all algorithms is about 14 times on 16 physical cores. It is worth noting that the Hash3 algorithm showed a noticeable improvement on the CPU when using hyper-threading, unlike other algorithms under consideration. This can be explained by more efficient memory management. Speedup on 32 threads and 16 physical cores for the naive algorithm and the Rabin-Karp algorithm is 16-17 times, while for the Hash3 algorithm it is 25 times. Next, we run the developed code on the GPU and show that the Kokkos version of the Rabin-Karp algorithm loses to the Cuda version on the GPU by no more than 10 %. At the same time, the Kokkos versions of the naive and Rabin-Karp algorithms outperform our Cuda baseline version by 10-20 %. The authors did not set themselves the goal of optimizing the Cuda code. We believe that it is possible to optimize the Cuda code to match the performance of the Kokkos version. However, it is noteworthy that sometimes the baseline Kokkos version runs faster than the baseline Cuda version. Overall, we demonstrate that in many cases the Kokkos version works as well as native OpenMP or Cuda code. In the worst case, the performance loss was no more than 10 %. We believe that paying this price is reasonable in order to run a single code on different devices.

Еще

Kokkos, single-source programming, cross-platform software, heterogeneous computing, program performance, string matching algorithms, bioinformatics

Короткий адрес: https://sciup.org/143183458

IDR: 143183458   |   DOI: 10.24412/2073-0667-2024-2-58-71

Статья научная