Hybrid method of classification of text data with specialized terminology
Автор: Serova V.S., Hollay A.V., Bunova E.V.
Рубрика: Информатика и вычислительная техника
Статья в выпуске: 3 т.25, 2025 года.
Бесплатный доступ
In the context of exponential growth of text information, especially in domain-specific areas (technical, medical, legal), the task of automatic classification of texts saturated with highly specialized terminology is of critical importance. Existing approaches, including transformer models (BERT), often demonstrate a decrease in accuracy when working with rare or domain-specific vocabulary due to training on common corpora. The aim of the study is to develop a hybrid method Combined Neural BERT (CNB), which provides maximum classification accuracy (100 %) for texts with specialized terminology due to the synergistic combination of the advantages of contextual language models, lexical-statistical methods, and visualization tools. Materials and methods. The proposed CNB method integrates three key components: 1) BERT (or its derivatives) for generating deep contextual embeddings that take into account semantics and word order; 2) fully connected neural networks (FCNN) acting as a classifier based on BERT features and/or processing lexical-statistical features; 3) the Word Cloud method and TF-IDF for extracting and visualizing key domain terms, forming a feature dictionary and improving interpretability. The architecture of the method includes the following stages: text preprocessing (normalization, cleaning), parallel feature extraction (BERT contextual embeddings + TF-IDF vectors), merging feature spaces, classification using FCNN, interactive tuning based on the Word Cloud analysis. Results. The hybrid CNB approach was tested on a real corpus of 10,000 requests from residents of the Chelyabinsk region (7 thematic categories) using 70 key terms and 150 stop words. The method demonstrated 100 % classification accuracy after three training iterations (total time is 90 minutes). Key benefits: Higher accuracy due to compensation of BERT's weaknesses in specialized domains with lexical-statistical features; Improved interpretability due to visualization of key terms with the “Word Cloud”; Efficiency of processing large volumes of specialized texts. Conclusion. The developed hybrid CNB method has proven its exceptional efficiency for classifying texts with highly specialized terminology. It is a powerful tool for analyzing domain-specific text arrays (legal documents, technical documentation, medical reports, etc.) in the context of constantly growing data volu¬mes. Prospects include adapting the method to other domains and optimizing computational efficiency.
Classification, BERT, FCNN, hybrid models, specialized terminology, word cloud, semantic analysis
Короткий адрес: https://sciup.org/147251613
IDR: 147251613 | DOI: 10.14529/ctcr250304