KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired
Автор: Divya, Suresha D.
Журнал: International Journal of Image, Graphics and Signal Processing @ijigsp
Статья в выпуске: 2 vol.18, 2026 года.
Бесплатный доступ
This research outlines a comprehensive dual-modality speech recognition system designed specifically to support hearing-impaired students in understanding spoken Kannada through synchronized processing of auditory signals and visual articulatory cues. The approach capitalizes on deep learning capabilities to improve performance to extract speech-related features from spectrograms and Mel-Frequency Cepstral Coefficients (MFCC) for audio, and lip movement discriminative features via CNNs and Temporal Convolutional Networks (TCNs) for visual input. A hybrid architecture, KanAVNet (Kannada Audio-Visual Network), based on a CNN–BiLSTM framework is integrated with a Connectionist Temporal Classification (CTC) loss function to enable robust sequence-to-sequence mapping while addressing temporal alignment challenges in audio-visual speech recognition. The system is fitted on a custom-developed Kannada audiovisual dataset, addressing the scarcity of regional-language AVSR resources. Empirical evidence shows that the model performs with a high degree of accuracy of 93.2%, a Word Error Rate (WER) of 9.8%, and an F1 score of 91.2%, outperforming baseline unimodal and existing multimodal models. This research highlights the effectiveness of multimodal fusion strategies in noisy environments and showcases the potential of AI-driven tools in promoting accessible and inclusive education for students with auditory impairments.
Assistive Technologies, Bilstm, Connectionist Temporal Classification, Dual-Modality Speech Recognition System, Kanavnet (Kannada Audio-Visual Network), MFCC-Mel-Frequency Cepstral Coefficients
Короткий адрес: https://sciup.org/15020307
IDR: 15020307 | DOI: 10.5815/ijigsp.2026.02.06