Computational Pipeline for De Novo Recognition of Transcription Factor Binding Sites in Bacterial Genomes

Автор: Mukhin A.M., Oschepkov D.Yu., Lashin S.A.

Журнал: Проблемы информатики @problem-info

Рубрика: Прикладные информационные технологии. Биоинформатика

Статья в выпуске: 4 (65), 2024 года.

Бесплатный доступ

The search for transcription factor binding sites (TFBSs) in bacterial genomes is one of the most important steps for their study and subsequent use in biotechnology and microbiology. The characteristic length of TFBS is 5-20 nucleotide pairs, and each transcription factor has the ability to bind to a set of sites similar in sequence. The concept of motif is used to describe the spectrum of sequences that have substantial (non-random) similarity. That is, a motif in molecular biology is a group (or a representative of a group, depending on the context) of relatively short sequences of nucleotides (or amino acids) that have sufficient similarity due to their performance of a single biological function, e. g., binding of a single transcription factor. The similarity of motifs is directly used by various bioinformatics approaches for their de novo detection in genomic sequence samples, and is possible only if there is sufficient enrichment of the tested sample with the corresponding sequence similarity. In cases where the bacterial genome is insufficiently annotated, such as when working with a newly sequenced genome, it is the de novo motif detection method that proves to be the most effective for finding TFBSs. In this paper, we propose a set of computational motif search pipelines that take as input the bacterial genome data and its primary annotation. The proposed pipelines using two different approaches (full-genome search, when de novo motifs are searched for in a set of promoters of a single genome, and phylogenetic footprinting, when motifs are searched for among a set of promoters of similar genes and/or opcrons) to search for motifs, provide the researcher with a comprehensive set of settings for obtaining the most complete annotation by sites of both the whole genome and more detailed annotation of the regulatory region of the selected gene. The presented pipelines were implemented using both the modern Nextflow platform and scripts in the Python programming language. Also, the following tools were used within the pipelines: BoBro as a method for searching de novo motifs in promoters of a single organism; MP3, which implements de novo motif searching by phylogenetic footprinting in a set of promoters, GOST to identify similar genes and/or opcrons between two genome assemblies, OpcronMappcr to determine the operon structure of the genome, and TomTom for annotation of de novo motifs. We have developed an indexed metadata database for known bacterial genomes using an embedded SQLite DBMS, which allows us to significantly accelerate data retrieval for further calculations.

Еще

Nextflow, python, sqlite, jbrowsc2

Короткий адрес: https://sciup.org/143184149

IDR: 143184149   |   DOI: 10.24412/2073-0667-2024-4-69-83

Статья научная