Curriculum learning for data filtering and domain adaptation of neural machine translation models

Автор: Karpachev N.E.

Журнал: Труды Московского физико-технического института @trudy-mipt

Статья в выпуске: 3 (67) т.17, 2025 года.

Бесплатный доступ

Modern Neural Machine Translation (NMT) systems require large volumes of parallel data for training. However, corpora collected from diverse sources often contain significant noise, such as translation inaccuracies, stylistic mismatches, and semantic errors. The conventional approach of static filtering faces a critical trade-off: overly aggressive filtering results in the loss of valuable linguistic diversity and harms model generalization, while lenient filtering allows artifacts that degrade final translation quality to remain. To address this challenge, this work proposes a dynamic filtering method based on curriculum learning technique. In this approach, data selection criteria become progressively stricter as training progresses, enabling the model to first master general patterns before focusing on high-quality examples. We empirically demonstrate the effectiveness of this method. Furthermore, we extend the same methodology to propose a framework for adapting Large Language Models (LLMs), guiding them from sentence-level translation to the more complex task of document-level translation.

Еще

Neural machine translation, curriculum learning, large language models

Короткий адрес: https://sciup.org/142245836

IDR: 142245836 | УДК: 004.852