Extraction of knowledge and relevant linguistic means with efficiency estimation for the formation of subject-oriented text sets
Автор: Mikhaylov Dmitry Vladimirovich, Kozlov Alexander Pavlovich, Emelyanov Gennady Martinovich
Журнал: Компьютерная оптика @computer-optics
Рубрика: Анализ данных
Статья в выпуске: 4 т.40, 2016 года.
Бесплатный доступ
In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and selecting texts to the corpus by analyzing the relevance to the initial phrase. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is defined by the total numerical estimate of the coupling strength of words from the initial phrase jointly occurring in phrases of the text under analysis. The paper considers known variants of such estimation procedures and their application for the search of distinct components which reflect the initial phrase in the texts selected to the topical text corpus. These components correspond to words and their combinations. In comparison with the search of such components on a syntactically marked text corpus, the method for text selection offered in this paper enables a 15-times reduction (on average) in the output of phrases which are irrelevant to the initial one in terms of either the described knowledge fragment or its expression forms in a given natural language.
Pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval
Короткий адрес: https://sciup.org/14059596
IDR: 14059596 | DOI: 10.18287/2412-6179-2016-40-4-572-582