Historical Documents Classification using BERT: LLM and Historical Domain

Автор: Galushko I.N.

Журнал: Вестник Пермского университета. История @histvestnik

Рубрика: Digital humanities: сквозь алгоритмы к знаниям

Статья в выпуске: 2 (69), 2025 года.

Бесплатный доступ

At the present stage of studying Russian history, discussions about processing large collections of historical doc-uments are becoming especially relevant. Today, the process of digitizing archival collections is actively underway, but in most cases, the created corpus is simply posted on the site and remains unused for years. This is because we often encounter difficulties in processing an entire collection when accessing the funds of a large social institution; digitized funds can contain hundreds of thousands of pages of documentation. Limited time does not allow even a quick reading to cover all the available documents. This problem could be at least partially solved by using LLMs for annotation or text search optimization. However, at the current stage of archival development, specialists are just beginning to work with natural language processing methods. The main request of the professional community is to study the specifics of the work of artificial intelligence models and machine learning on historical domain texts. This article is a preliminary study of modern LLMs' interaction with historical texts. For the analysis, we chose one of the most popular models – BERT – and one of the most common NLP tasks – classification.

Еще

Text classification, political history, artificial intelligence, attention mechanism analysis, machine learning, BERT, NLP

Короткий адрес: https://sciup.org/147250818

IDR: 147250818 | УДК: 930.23 | DOI: 10.17072/2219-3111-2025-2-147-158