Associative computing implementation library Custar: data representation for bioinformatics problems

Бесплатный доступ

Over the past few years, genome processing has become a widely sought-after task. Both medical laboratories (from PCR tests to genetic passports) and research teams are engaged in various processing options. At the same time, both the first and the second process large amounts of data either due to the number of samples, or due to the length of these samples: from tens of thousands to several billion nucleotides. Note that a huge part of the calculations is related to the search for individual nucleotides or their sequences in a larger sequence or in a large number of sequences. So it is advisable to use associative parallel computing. But associative architectures are not represented on the computer hardware market, unlike widely available graphics accelerators. The cuSTAR library was designed to implement associative computing model STAR-machine on graphics accelerators. In this paper, a method of organizing data for processing genomes by associative algorithms is proposed. In this paper, we propose several methods of data organization. Such an organization allows the use of associative algorithms to solve various tasks related to genome processing. Let’s recall a brief description of the associative model of the STAR machine, and its cuSTAR implementation. Both the castor library and its STAR machine model use three types of data for associative processing. The Table type stores data as a binary table. The Slice type is used to access the bit column, and the word type is used to access the bit string. It should be noted that data processing is performed mainly usingbit columns. Therefore, the presentation of data in the cuSTAR system is fundamentally different. Usually, a sequence of nucleotides is represented by a array of characters. It can be considered as a binary table in which the rows specify one character. That is, the data is stored line by line. To use cuSTAR, a variable of type Table is stored by columns. The alphabet of nucleotides consists of the symbols A (adenine), C (cytosine), G (guanine) and T (thymine). Also, the “-” symbol is often used in the data to indicate possible gaps in reading, insertions or deletions in the nucleotide sequence. Thus, four or five characters are used, depending on the task. We propose two ways to encode a sequence of nucleotides. The first method is optimized for memory usage. The second method is optimized for the search time of the nucleotide in the sequence. The memory-optimized method uses the following encoding: “000” for “-” symbol, “001” for adenine, “011” for cytosine, “101” for guanine, “111” for thymine. The time-optimized method uses the following encoding: “1000” for adenine, “0100” for cytosine, “0010” for guanine, “0001” for thymine. It uses 4 bits instead of 3 bits, but allows you to replace the task of searching for a word in the table with a less time-consuming one. To find all occurrences of a nucleotide in the sequence, one needs to determine the position “1” in the code of this nucleotide. The proposed data encoding methods are more compact than the standard representation in the form of an array of characters. The time-optimized method makes it possible to search for nucleotides in a sequence an order of magnitude faster than the procedure from the t memory-optimized method. But the memory-optimized method is preferable if the representation of the nucleotide sequence in the form of a graph is used. And in this case, the de Bruijn graph is constructed from the original sequence of nucleotides in a trivial way. Although with symbolic encoding of nucleotides, this is a time-consuming and memory-consuming task. When using cuSTAR, it is easy to construct a de Bruijn graph from a sequence of nucleotides of any parameter k. The graph is given by a list of edges, which is one of the standard representation for associative processing. Note that by defining the graph as a list of edges, we avoid problems associated with repeating arcs. When reading the sequence, a table GEN of size 31 is formed, where 1 is the length of the input sequence. For a graph given by a list of arcs, we form tables LEFT and RIGHT of size 3k(l - k). The table LEFT is obtained by copying к times the columns of the GEN into the corresponding columns with an upward shift. In turn, the table RIGHT is obtained by copying with a shift up one row of the table LEFT. Copying of all tables is performed in parallel. Since genome processing involves multiple searches over a large amount of data, the development of associative algorithms for this area is relevant. The applied value of the work consists in the possibility of executing these algorithms on graphics accelerators - widespread equipment from personal computers to cluster systems.

Еще

Gpu, cuda

Короткий адрес: https://sciup.org/143180837

IDR: 143180837   |   DOI: 10.24412/2073-0667-2023-1-60-68

Статья научная