Speech and text data preparation for developing of an automatic speech recognition system for the Karelian language

Бесплатный доступ

This paper addresses some aspects of collecting and preparing language data of the Livvi dialect of the Karelian language needed for training a system of automatic speech-to-text conversion. The importance of such technologies for the Karelian language derives from its status as a low-resource language, which is a serious obstacle to its study and preservation. The main tasks at the current stage of the research are to collect and annotate speech and text corpora, as well as to create a transcription dictionary. The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). All the recordings were transcribed and segmented into single utterances. The volume of records after the removal of “junk” fragments was 3,5 hours. The volume of the text corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which will subsequently be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription). In the further research collected text and speech data will be used for training and testing the Livvi-Karelian speech recognition system.

Еще

Karelian language, livvi-karelian dialect, natural language automatic processing, speech recognition systems training, datasets, corpus linguistics

Короткий адрес: https://sciup.org/147241456

IDR: 147241456   |   DOI: 10.15393/uchz.art.2023.924

Статья научная