Application of the Vectorization Method in Natural Language Processing Problems Using Machine Learning

Бесплатный доступ

The article presents a study on the practical application of vectorization methods in natural language processing tasks using machine learning. The main purpose of the work is to implement and evaluate the effectiveness of vector texts representations in order to classify mathematical texts by subject matter. One-hot encoding, Bag of Words, and TF-IDF implemented in Python using Pandas, NumPy, Matplotlib, and scikit-learn libraries are considered as vectorization methods. A pipeline of text data preprocessing, vectorization and classification based on a decision tree has been developed. The analysis of the results showed that the TF-IDF method allows taking into account rare but semantically significant terms, which contributes to a better separation of documents into mathematical areas. The scientific novelty is represented by the adaptation of classical vectorization methods to a highly specialized corpus of mathematical texts, and this field was not previously sufficiently covered in the research. The article demonstrates the practical value of the integration of thew natural language processing methods with machine learning for domain-oriented categorization of scientific texts and opens up prospects for further research in this area.

Еще

Machine learning, vectorization method, clustering, natural language processing, Python

Короткий адрес: https://sciup.org/140313568

IDR: 140313568   |   УДК: 004.89   |   DOI: 10.18469/ikt.2025.23.2.07