От данных к диагнозу: обзор методов автоматического реферирования медицинских текстов на основе обработки естественного языка (NLP)

Бесплатный доступ

Сокращение клинической документации с помощью технологий обработки естественного языка (NLP) представляет собой важное нововведение в системе здравоохранения, призванное справиться с растущими объемами неструктурированных текстов. Данный обзор определяет методологическое поле исследования автоматического реферирования медицинских текстов, уделяя особое внимание клинической документации и выписным эпикризам. В соответствии с принятыми методологическими рекомендациями по проведению обзоров литература за 2017–2025 гг. была проанализирована в базах PubMed, IEEE Xplore, ACM Digital Library и Web of Science с заранее заданными ключевыми словами. Из 156 найденных статей 30 соответствовали критериям включения, касающимся автоматизированного суммирования медицинских текстов на основе NLP. Результаты показывают, что в предыдущих исследованиях преобладали экстрактивные методы, однако в настоящее время трансформерные модели, такие как BERT, BART и варианты GPT, демонстрируют более высокую эффективность. Оценка результатов в основном проводится по метрикам ROUGE, однако клиническая валидация выполняется редко. Основные проблемы связаны с нехваткой данных, риском генерации недостоверных сведений в абстрактивных моделях и недостаточной интеграцией решений в рабочие процессы. Будущие исследования должны уделять приоритетное внимание разработке человеко-ориентированных (human-in-the-loop) систем, стандартизированных клинических эталонов и многоязычных корпусов для обеспечения безопасного и эффективного внедрения таких технологий в клиническую практику.

Еще

Обработка естественного языка, автоматическое реферирование медицинских текстов, клиническая документация, выписные эпикризы, трансформерные модели, электронные медицинские карты.

Короткий адрес: https://sciup.org/14135078

IDR: 14135078   |   DOI: 10.47813/2782-5280-2025-4-4-2048-2055

Текст статьи От данных к диагнозу: обзор методов автоматического реферирования медицинских текстов на основе обработки естественного языка (NLP)

DOI:

Narrative clinical text can exist in the form of progress notes, discharge summaries, radiology reports, pathology findings, and consultation letters, among other forms, and healthcare systems around the globe produce copious quantities of narrative clinical text [1]. This unformatted documentation has crucial patient data required to make a clinical decision, which makes its format a major obstacle to access and understanding in a timely manner [2]. The use of electronic health records in healthcare institutions has led to the growth of textual data that grows exponentially, meaning clinicians are required to process such textual data when attending to patients [3]. Physicians and nurses in high-traffic clinical settings experience information overload on the one hand, trying to find essential information among a lot of textual records, which, in a lot of cases, is situated under severe time restrictions, at the cost of a comprehensive review [4]. High-quality summarisation tools would have a significant impact on the decrement of cognitive load and maintaining critical clinical information that might enhance clinician wellbeing and patient safety [5].

NLP provides computation methods of transforming unstructured text into either structured or semi-structured forms that can be understood in a short period [6]. The methods have been proven to be effective in many areas such as news summarisation, legal document processing, and review of scientific literature [7]. Specifically, summarisation methods facilitate the extraction or creation of brief summaries of lengthy documents, which may aid diagnosis because observations across multiple sources are condensed into actionable summaries, discharge processes because clinical histories are summarised, and hand-off communications between care teams in the event of a shift change or a transfer. The possible efficiency improvements are significant because now clinicians waste significant parts of their working time on documentation and do not directly assist patients [8].

Nevertheless, the medical field poses specific issues that are peculiar to it as compared to general-domain summarisation systems. These are high stakes decision situations where mistakes can result in direct harm to patients, strongly specialised terminology including disease names, pharmacological and procedural terminology, intensive use of abbreviations, and strict privacy and regulatory policies that enforce a strict control on health information [9]. Clinical summarisation should therefore be factual but not fabricated, retain important information such as the dosage and diagnosis of medications, be interpretable to clinical users, and must fit perfectly into the current workflows [10]. These demands are far more than general-domain summarisation ones and require specialised methods.

Although there has been an increased interest in research on the application of medical NLP in recent years, the synthesis of NLP-based medical summarisation methods has been incomplete in computational linguistics, medical informatics, and clinical specialty literature [11]. The scoping review is a response to this gap since it has provided a systematic mapping of input text types that have often been targeted by the summarisation process, a spectrum of summarisation processes between extractive and abstractive processes, evaluation practices and their suitability to clinical settings, and deployment issues that have hindered the movement of translation between research and practice. The review particularly highlights the clinical documentation and discharge summaries as one of the most important hand-off documents that need high fidelity summarisation so as to facilitate continuity of care [12]. Synthesising the existing evidence on these dimensions, this review can reveal research gaps and trends in methodology, as well as future opportunities to improve the current clinically-applicable systems of summarisation that can deliver safe and effective healthcare provision.

METHODOLOGY

Review Framework

The review follows a scoping review paradigm suitable in mapping the rapidly changing areas and in detecting knowledge gaps as opposed to responding to clinical questions with a narrow scope [13]. The scoping reviews are suited to analyse emerging fields because the heterogeneity of research does not allow meta-analysis, unlike systematic reviews, which are designed to synthesise evidence of particular interventions. The scoping methodology allows mapping the literature in the field well, defining the major concepts and gaps of the research, and elucidating definitions and boundaries of the field. Considering the fast pace of the technological development of the NLP field and the variety of methods employed to conduct medical summarisation, this framework will describe the situation in the field well and outline the interests in which to focus the research in the future.

Search Strategy

Four large databases of the literature (PubMed, a biomedical literature, IEEE Xplore, a computational and engineering perspective, ACM Digital Library, a computer science contributions, and Web of Science, an interdisciplinary coverage) were searched systematically. The search included articles published since 2017 (January). This period was chosen to cover the revolution of the transformers in NLP since the inception of attention mechanisms and extending to the new advancement of large language models that have rocked the scene. The query terms were search words with domain and technical strategies in the following format: (medical text summarisation) or (clinical note summarisation) or (discharge summary generation) or (radiology report summarisation) and (natural language processing) or (NLP) or (transformer) or (deep learning) or (BERT) or (GPT). According to each platform, Boolean operators and differences between the syntax and syntax variations of the databases were used to ensure the maximum retrieval.

Eligibility Criteria

The inclusion criteria included: original research articles with automated NLP-based summarisation of healthcare or biomedical text with clear methodological descriptions [14]; those that involved clinical documentation, such as discharge summaries, radiology reports, and progress notes or biomedical literature with direct healthcare workflow relevance; those that clearly applied extractive, abstractive or hybrid summarisation methodology with technical description sufficient to perform performance assessment. Publications that were not in English, that did not involve any automation element in their workflow, that did not produce a summarisation output, that were not based on the health domain, text processing applications not involving some element of opinion piece and editorials, and those with an inadequate methodological description were all excluded. Published reviews and surveys were considered as a reference mining and contextualisation, but not included in the main analysis to prevent the occurrence of the same results.

Selection Process and Data Extraction

The preliminary search in databases produced 156 articles according to preliminary search criteria. After removing duplicates in databases, 112 articles were left to be screened. The screening based on the eligibility criteria was done to filter out 44 articles that were evidently not within the scope, and 68 articles were then brought to a full-text analysis [15]. Full-text evaluation with full-text screening based on all eligibility criteria led to the removal of 38 more articles, mostly due to the absence of sufficient methodological description, the absence of summarisation output, or the non-clinical text focus. The ultimate total synthesis was comprised of 30 articles that qualified. Information was systematically gathered with input text type and clinical setting, model architecture and summarisation strategy, datasets and corpus used, evaluation measures and methods, main findings of the key performance, deployment and integration, determined limitations [16]. The extracted data were arranged in thematic categories with a view to allow synthesis of studies and determine patterns and gaps.

FINDINGS

Input Text Types and Clinical Contexts

Studies that were reviewed involved three major types of texts with different features and profiles of clinical applicability. The summarisation of biomedical literature, which was about one-third of the included studies, targeted the process of condensing journal articles, abstracts, and descriptions of clinical trials, mostly related to PubMed and ClinicalTrials.gov repositories. They are useful in research synthesis and review of evidence but little clinical workflow applicability has been demonstrated because they handle published literature instead of patient-specific documentation [17]. The most significant one was clinical documentation summarisation, which tackled progress notes, physician-nurse communication, and radiology report that were created in the context of active patient care [18]. These studies have significantly greater clinical usability and do not have homogenous input formats across institutions, temporal dynamics as patient conditions change, and differing documentation practices across specialties and individual clinicians [19].

Although discharge summary generation is a key hand-off documentation that is central to care transitions, surprisingly specific research focus was given to it. Very little was specifically aimed at automated production of detailed discharge summaries based on entire records of hospitalisation as admission notes, day-to-day progress notes, procedure, and laboratory findings. This is a huge gap considering the power of discharge summary in terms of continuity of care between hospital and post-acute care, interaction with primary care providers and patient comprehension of their hospital course and post-discharge needs [20]. The field of multidocument summarisation that combines several input documents, including admission notes, serial progress documentation, and discharge instructions, is far less developed than single-document models, even though it is more indicative of the real-world clinical task of integrating scattered information into a consistent set of narratives [21].

Summarisation Approaches and Model Architectures

The methods of summarisation belong to specific groups, and their strengths and weaknesses are also presented in a characteristic manner and have clinical implications [22]. Extractive methods use a source text to pick out important sentences or phrases as summaries to maintain factual allegiance because no new text is created. Extractive techniques commonly however give summaries of a non-natural coherent flow, there can be repetition of information in more than one of the selected sentences and it does not have the ability to compress and paraphrase information to enhance readability. Abstractive methods use language generation models to create new text with source meaning, and provide more natural and flexible text that looks more like human-written summaries. But abstractive methods produce high risks of hallucination in which the models create information that is not found in source documents, which is a critical issue in a clinical environment where an invented medication dosage or diagnosis could result in harm to the patient.

Hybrid methods strive to use the advantages of extraction and abstraction, in which salient sentences are initially detected using extractive techniques, followed by subsequent abstractive model reformation [9]. Domain-aware methods are another specialisation that use clinical ontologies and medical entity extraction, and healthcare knowledge graphs to direct summarisation to clinically-relevant content and guarantee the maintenance of important clinical concepts. Large language model models are considered the next technological edge, and new models, including GPT variants and instruction-tuned ones, have shown good performance on clinical summarisation tasks but have major concerns with factual reliability, computational cost, and clinical validation criteria [20]. This area exhibits apparent temporal development of statistical extractive models to neural sequence-to-sequence designs to transformer-based models such as BERT, BART, Pegasus, and T5 that have better contextual knowledge and coherent text output generation [16]. Nevertheless, new generations add more complexity, computing needs and validation difficulties that should be weighed against performance improvements.

Datasets and Evaluation Methodologies

The presence of data sets has a strong impact on the development of research and also creates systematic biases during model development. The predominant studies on clinical summarisation use the MIMIC-III and MIMIC-IV databases, which offer de-identified intensive care data on Beth Israel Deaconess Medical Center with rich clinical narrative support to facilitate extractive and abstract model building [19]. ClinicalTrials.gov is a structured information on clinical trials that is commonly used in the summarisation of study objectives, eligibility criteria, and outcome measures [6]. Radiology report corpora solve diagnostic imaging interpretation domainspecific summarisation issues. PubMed abstracts facilitate the process of summarising biomedical literature in between research findings and clinical practice. Nevertheless, generalisability to other healthcare systems using different documentation practices, terminology conventions, and clinical workflows is of significant concern with concentration of available datasets in US academic medical Centres.

Assessment methods merge automatic measures and a little bit of human evaluation, which produces major gaps in clinical utility perception [23]. ROUGE variants characterizing word overlap between generated and reference summaries with ROUGE-1 indicating unigram overlap, ROUGE-2 indicating bigram overlap, and ROUGE-L indicating longest common subsequence continue to be the most common measures of performance that allow comparing models [24]. The original BLEU scores are used to evaluate fluency and grammatical accuracy whereas BERTScore is used to evaluate semantic similarity with the help of transformer embeddings that can offer more insightful knowledge about whether the meaning is preserved or not even though the surface lexical matching is made [25]. The research using state-of-the-art architecture such as BART, Pegasus, or GPT models usually provides a series of metrics that convey both the lexical and contextual quality of knowledge to ensure a thorough characterisation of the performance [17]. Staff clinician review focusing on the thoroughness of clinical content, medical fact accuracy, and clinical usefulness is a much more reliable quality measure that is consistent with real clinical needs but expensive, inter-rater unreliable, and, therefore, rarely seen in the literature [26]. Measures of faithfulness that explicitly identify hallucinated content absent in source texts turn out to be the next priority considering the healthcare safety standards, but the standardised approaches and validated measurements instruments are still being developed [22].

Deployment Status and Integration Challenges

Although a lot of research has been done to show technical feasibility, clinical deployment of summarisation systems has not been made at all.

Most of the summarisation systems mentioned in the reviewed literature are research prototypes that have been created to be experimented with and benchmarked but not implemented in clinical practice [9]. The transfer of such systems out of controlled research systems into live electronic health record systems is a daunting technical, regulatory and cultural challenge that most research has not tackled. The challenge of workflow integration can be seen as one of the most crucial issues; to be able to offer value, the work processes of summarisation tools should be harmonised with such clinical processes as hand-offs between shifts, discharge documentation processes, and radiology and other diagnostic service reporting workflows [4]. The bulk of the literature is more centered on intrinsic summary quality measures than they are on the effects of workflow efficiency, user acceptance, or the effects of treatment on patients, and this restricts the knowledge of practical clinical applicability to a great extent.

Another significant impediment to clinical adoption is trust and transparency [27]. Intelligible and auditable reports with traceable arguments that allow the verification of the content of summaries against the source documentation are appropriate requirements of healthcare professionals [18]. Blackbox models which fail to provide explanations on their outputs are quickly opposed in clinical settings where responsibility in making patient care-related decisions should be addressed. Regulatory matters such as data protection under healthcare regulations, compliance with institutional policies such as privacy, and so on are vital deployment requirements that complicate the implementation [22]. The human acceptance issue related to the automation risk in high-stakes clinical scenario leads to the increasing interest in the hybrid deployment models that involve AI-generated initial summaries and the obligatory review of these summaries by clinicians and their refinement before the clinical application [25]. The under-explored nature of such human-in-the-loop models that conceptualize AI as an assistant and not a substitute still comes as a surprise in terms of health care environment despite the high potential of these models to balance the benefits of automation efficiency with the safety guarantees of human oversight.

DISCUSSION

This scoping review exposes a research area that is characterised by high rate and rapid technological application taking place alongside the unending impediments of clinical translation. The development of the statistical extractive methods based on the term frequency measurement and the neural sequence-to-sequence models to the modern transformer-based and LLM systems can be regarded as an impressive innovation in computational power when it comes to handling clinical text [12]. Modern systems are now able to produce coherent, contextually relevant summaries of complex clinical histories containing a variety of note types and large temporal gaps that previously were completely invalid to the automated processing machinery [7]. However, despite such technical success, the difference between good research results on benchmark datasets and proven clinical results in a real healthcare context is significant and alarming.

The problem of evaluation methodology has arisen as a great limitation towards meaningful assessment of progress. The intensive use of ROUGE and other lexical overlap measures are insufficient to measure clinical validity because word matching between generated and reference summaries are not good predictors of whether the summaries indeed preserve safety-critical information, facilitate clinical decision-making, or offer actionable patient care information [15]. An overview may have a good ROUGE score and may hide important medication allergies or report bad diagnostic results. The lack of standardised clinical benchmarks which are annotated with gold standards of the experts significantly undermines comparison across studies and evaluation of field progress. The use of institution-specific datasets having different documentation styles, terminology conventions, and clinical settings does not allow creating generalisable performance baselines that could be applicable across healthcare systems. Future studies should focus on the establishment of discharge-specific and clinical handoff standards with gold standards therefore annotated by clinicians that assess completeness of critical clinical content, accuracy of medical fact, and actionability in continuation of care instead of simply using lexical similarity metrics.

Safety and faithfulness are two unconditionally significant conditions of clinical deployment, which are not sufficiently met by the existing systems. LLM and, in particular, abstractive models pose significant risks of hallucinating clinical text, which may or may not create medications dosages, or may or may not create diagnostic results, or may or may not remove the vital safety details, including allergies or contraindications [5]. Therefore, in healthcare settings involving serious repercussions on patient safety and possible legal consequences, none of the existing summarisation systems meet the reliability criteria necessary to run clinical unsupervised [8]. Mechanisms of faithfulness verification, the ability to detect systematic errors and transparent reasoning that would allow clinicians of the verification system to be verified should become compulsory required features of clinical summarisation systems and not optional research additions. AI systems as a draft summary-generating assistant that requires a clinician to review assistants offers viable short-term opportunities to strike a balance between automation efficiency and information safety with human control over the information that drives patient care [28].

Limitations of the dataset also restrict the model development, as well as the ability to generalise it to various clinical situations. The predominance of MIMIC databases in clinical summarisation studies by far produces models that may only be applicable to the US intensive care documentation standards, and may not be applicable to other clinical specialties, care units or healthcare systems. Even when the models are trained using the documentation structure, abbreviations used, specialty terminology, and clinical workflow that is already present in the available English-language US academic medical centre corpora, it is substantial, which implies that the models could be ineffective once implemented in different healthcare systems with different practices [26]. The scarcity of data is due to privacy restrictions hindering the sharing of clinical data, deidentification mechanisms that can eliminate contextually-relevant data, and uninfrastructure funded to develop clinical NLP datasets. The diversification of datasets across institutions, clinical specialties, languages, and care systems beyond the single jurisdiction of a particular healthcare setting is a core requirement towards the creation of clinically-generalisable summarisation systems that are able to offer value beyond the scope of a restricted amount of research.

Future Research Directions

This synthesis generates several research priorities that require critical study in the future. Multidocument summarisation systems that have the ability to combine information on a long-term basis (e.g., whole hospitalisation history) of patient records such as admission records, serial progress notes, laboratory reports, imaging reports, and discharge planning into consistent narratives would bridge the gap in discharge summary generation found in this review. Human in the loop frameworks that will allow clinicians to engage in productive interaction with AI-generated summaries must be systematically developed and have large-scale validation studies to determine safety profiles, user acceptance, and efficiency effects and the best human-AI interaction patterns [25]. Unified clinical standards with professionally rated review criteria specifically tailored to discharge and clinical handoff summarisation activities would allow meaningful progress measurement and allow comparison of the models across research groups systematically.

Research studies that are multilingual and cross-cultural and extend beyond the English and UScentric datasets would help tremendously in improving global applicability and allow the summarisation systems to benefit a variety of healthcare systems with varying documentation practices. Models of resource efficiency that could be adopted by healthcare organisations that have limited computational capabilities would democratize access to capabilities of summarisation beyond well-endowed academic medical Centres. Outcome-based performance measures of the actual effects on reducing workload on clinicians, improving communication, reduction of errors and patient safety would supplement technical performance measurements with clinically-relevant value measures. Ethical frameworks and regulatory guidelines on clinical AI implementation should be developed proactively with technical capability development to facilitate responsible implementation [29]. Multimodal summarisation that combines textual clinical notes with numerical laboratory data, structured drugs lists and diagnostic imaging results can eventually offer more detailed and clinically-beneficial patient summaries than text-only methods.

CONCLUSION

The potential of NLP-based clinical documentation summarisation has huge potential in enhancing healthcare efficiency, quality of communication, and eventually patient safety and outcomes [14]. This scoping review reveals that the area has come a long way out of simple extractive statistical techniques to highly engineered transformer-based and long-Long-Machine (LLM) models that are actively capable of producing coherent and contextually-relevant clinical summaries. Nonetheless, there are critical obstacles between the successful research performances on the one hand and the safe and efficient clinical application on the other hand. The limitations of datasets to develop and generalise models, failures on evaluation of weaknesses to reflect clinical validity, issues of faithfulness to hallucinated content, and lack of workflow integration to implement in practice all must be addressed in an orderly and rigorous way before the summarisation systems can be responsible enough to enter the clinical practice.

To continue to make clinically-deployed summarisation real in the future, it will take actually collaborative efforts between NLP researchers to build technical capabilities, clinicians to define requirements and justify outputs, health system engineers to overcome integration issues, and ethicists to make sure it is a responsible endeavour [28]. The human-in-the-loop systems with strong safety guarantees provided by expert control, standardised assessment frameworks with clinical validity requirements that are not only lexical similarity, and various multilingual datasets that enable worldwide application across healthcare systems are among the priority development areas [20]. It could be hoped that with continued interdisciplinary focus on these priorities, a technical innovation and clinical validation, and ethical governance, NLP summarisation can eventually deliver on its potential to significantly decrease clinician documentation burden, improve care transition communication, support diagnostic assistance, and help to transform the current overwhelming clinical data into actionable diagnostic understanding that can help improve patient care [24].