Effective extraction of textual data from document images using transformer architecture of deep neural networks

Vykhodtseva Victoria Aleksandrovna; Popova Galina Vladimirovna; Vais Yuriy Andreevich

doi:10.18287/COJ1744

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Artificial intelligence

Effective extraction of textual data from document images using transformer architecture of deep neural networks

Автор: Vykhodtseva Victoria Aleksandrovna, Popova Galina Vladimirovna, Vais Yuriy Andreevich

Журнал: Компьютерная оптика @computer-optics

Рубрика: Численные методы и анализ данных

Статья в выпуске: 2 т.50, 2026 года.

Бесплатный доступ

In the context of modern digital document management, the automation of document pro-cessing, particularly in accounting, is a crucial factor in enhancing the efficiency of business pro-cesses. However, automated document processing encounters a range of specific challenges, both linguistic and structural characteristics of the data. Traditional text processing methods that rely on classical optical character recognition (OCR) algorithms do not provide sufficient accuracy in extracting data from document images, which limits their use in automated accounting systems. These challenges are particularly evident when processing documents with complex structures, specific element placement, and text content. This paper proposes a solution to this problem by applying a model based on a transformer neural network architecture, specifically adapted for working with document images. Within the scope of this study, the transformer model is trained on a dataset of accounting document images with varying element placements and text with Cy-rillic characters. The focus on Cyrillic text is particularly relevant, as research in this area has pre-dominantly concentrated on documents in English or other Latin-based scripts. This article in-cludes the results of training evaluated through specialized performance metrics. As a result of the experiment, at the final stage of training the model, the confidence loss was 0.156, which indicates that the model effectively minimizes the prediction error. The obtained accuracy of 0.868 showed a relatively high accuracy of forecasts. The Recall value of 0.905 indicates that the model effectively identifies most of the positive examples. The indicator F1=0.886 reflects a good balance between accuracy and memorability. The accuracy of 0.96798 indicates that the model's predictions are highly accurate. The use of the transformer model significantly improves the accuracy of extracting key in-formation, such as date, number, and organization name, from accounting documents containing Cyrillic text. The findings of this study affirm the potential of this method for implementation in automated accounting systems, contributing to enhanced efficiency and precision in processing accounting documents.

Attention mechanism, deep learning, document intelligence, neural network, optical character recognition, transformer

Короткий адрес: https://sciup.org/140314858

IDR: 140314858 | DOI: 10.18287/COJ1744