Development of a natural language processing tool for solving the application problem of extracting statistical data from text
Автор: Zakharova O.I., Bednyak S.G.
Журнал: Инфокоммуникационные технологии @ikt-psuti
Рубрика: Новые информационные технологии
Статья в выпуске: 1 (85) т.22, 2024 года.
Бесплатный доступ
Text analytics is used to explore textual content and obtain new variables from raw text, which can be used as input data for forecasting models or other statistical methods, including for solving fundamental problems. The purpose of the research: to analyze machine learning algorithms, practical developments in this field and to develop an integrated software instrument for text processing, using the structure of the algorithm, based on the BasicStats, ReadabilityStats, SovChLit libraries, allowing to extract statistics from raw texts of large volumes in Russian. A method of extracting statistical data from raw texts of large volumes based on machine learning and natural language processing in Python has been implemented, with the possibility of embedding it into other projects. A software instrument that use the functionality of textary library adapted for Russian language was developed, which allows to work with both texts and Doc-objects generated with spaCY library. The study was conducted using real text data collected from the information and news portal for the Samara region «63.ru» (in the context of the implementation of the conceptual project «Data Farm» by the artificial intelligence research laboratory). The developed software for extracting statistical data from text allows analyzing large volumes of text data and extracting useful information from them. It can be integrated into other software solutions as one of the linking modules in the of code optimization chain for text data processing programs.
Natural language processing, natural language processing algorithm, text processing, statistical extraction, machine learning, python
Короткий адрес: https://sciup.org/140307958
IDR: 140307958 | DOI: 10.18469/ikt.2024.22.1.13