Agile Intelligent Software Solution for Textual Content Authorship Identification Based on NLP, Artificial Intelligence and Machine Learning
Автор: Zhengbing Hu, Victoria Vysotska, Lyubomyr Chyrun, Roman Romanchuk, Yuriy Ushenko, Dmytro Uhryn, Cennuo Hu
Журнал: International Journal of Modern Education and Computer Science @ijmecs
Статья в выпуске: 2 vol.17, 2025 года.
Бесплатный доступ
The main goal of the work is to create an intelligent system that uses NLP methods and machine learning algorithms to analyse and classify textual content authorship. The following machine learning models for English and Ukrainian publications were tested and trained on the dataset: Support Vector Classifier, Random Forest, Naive Bayes, Logistic Regression and Neuron Networks. For English, the accuracy of the models was higher due to the more significant amount of text data available. The results for English fiction publication show that the Neuron Networks classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.97), recall (0.96), F1 score (0.98), and precision (0.96). It shows that Neuron Networks is particularly effective in capturing distinctive features of the writing styles of different English authors in scientific and technical texts. For the Ukrainian language, there is a drop in accuracy by 5-10% due to the smaller number of corpora of texts for teaching. The results for scientific and technical Ukrainian publications show that the Random Forest classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.88), recall (0.87), F1 score (0.87), and precision (0.87). It shows that Random Forest is particularly effective in capturing distinctive features of the writing styles of different Ukrainian authors in scientific and technical texts. Much worse accuracy results were shown by other models such as Support Vector Classifier (77%), Logistic Regression (73%) and Naive Bayes (70%). The results for the Ukrainian fiction publication show that the Random Forest classifier outperforms the other models in all evaluated metrics, achieving the highest accuracy (0.85), recall (0.84), F1 score (0.84), and precision (0.84). Much worse accuracy results were shown by other models such as Support Vector Classifier (77%), Logistic Regression (73%) and Naive Bayes (70%)
Author's Style, Machine Learning, Authorship Identification, Stylometry, NLP, Artificial Intelligence, Information Technology
Короткий адрес: https://sciup.org/15019753
IDR: 15019753 | DOI: 10.5815/ijmecs.2025.02.02