Development of a service for automatically extraction of medical concepts from Russian unstructured texts
Автор: Ronzhin L.V., Astanin P.A., Rauzina S.E., Yadgarova P.A., Zarubina T.V.
Журнал: Сибирский журнал клинической и экспериментальной медицины @cardiotomsk
Рубрика: Цифровые технологии в медицине и здравоохранении
Статья в выпуске: 2 т.40, 2025 года.
Бесплатный доступ
Introduction. A significant part of medical data is currently generated and stored in an unstructured (textual) form. One way to process unstructured information is named entity recognition (NER). In the classical view, solving the NER problem within medical texts involves identifying objects or concepts that have a specific context related to the actions or events mentioned in the text. The National Unified Terminological System (NUTS) has been developed since 2022 based on international and federal medical thesauri and other sources. It can be used as the term set for solving problems of this type. At the time of the study, there was no available information in the scientific literature about tools solving NER problem in unstructured Russianlanguage medical texts. Aim: To develop a tool for extracting named entities from Russian-language medical texts. Material and Methods. Named entity recognition is performed using the NUTS as the terminological framework. The preprocessing pipeline includes full text segmentation, sentences tokenization and dependency parsing, words lemmatization and morphological analysis. The Annotation tool has been evaluated on clinical guidelines. The primary evaluation metric is the ratio of correctly identified terms to the total number of experts’ extracted terms. Results. As part of this study, the Annotation tool for medical texts has been developed. It is an automatized tool for extraction and categorization NUTS terms. This service is based on combined use large language models and rules. The Annotation tool can analyze texts in any language of the Indo-European group using any terminological system. The Annotation tool is hybrid and extracts automatically up to 93% of terms from the actual unstructured guidelines texts. The quality of this service is comparable to international NER tools for English-language texts: cTAKES with 91% accuracy and MetaMap with an F1-score of 88%. Conclusion. The article presents the Annotation tool a hybrid service for named entity recognition within unstructured medical texts. The service was validated by extraction of NUTS terms in current clinical guidelines, with subsequent verification by medical experts. The obtained results demonstrate the promising potential of both this tool and the National Unified terminology system (NUTS).
NLP, natural language processing, NER, named entity recognition, NUTS, concept, knowledge base, ontology
Короткий адрес: https://sciup.org/149148598
IDR: 149148598 | DOI: 10.29001/2073-8552-2025-40-2-201-210