KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired

Divya; Suresha D.

doi:10.5815/ijigsp.2026.02.06

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Application-oriented computer-based techniques

KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired

Автор: Divya, Suresha D.

Журнал: International Journal of Image, Graphics and Signal Processing @ijigsp

Статья в выпуске: 2 vol.18, 2026 года.

Бесплатный доступ

This research outlines a comprehensive dual-modality speech recognition system designed specifically to support hearing-impaired students in understanding spoken Kannada through synchronized processing of auditory signals and visual articulatory cues. The approach capitalizes on deep learning capabilities to improve performance to extract speech-related features from spectrograms and Mel-Frequency Cepstral Coefficients (MFCC) for audio, and lip movement discriminative features via CNNs and Temporal Convolutional Networks (TCNs) for visual input. A hybrid architecture, KanAVNet (Kannada Audio-Visual Network), based on a CNN–BiLSTM framework is integrated with a Connectionist Temporal Classification (CTC) loss function to enable robust sequence-to-sequence mapping while addressing temporal alignment challenges in audio-visual speech recognition. The system is fitted on a custom-developed Kannada audiovisual dataset, addressing the scarcity of regional-language AVSR resources. Empirical evidence shows that the model performs with a high degree of accuracy of 93.2%, a Word Error Rate (WER) of 9.8%, and an F1 score of 91.2%, outperforming baseline unimodal and existing multimodal models. This research highlights the effectiveness of multimodal fusion strategies in noisy environments and showcases the potential of AI-driven tools in promoting accessible and inclusive education for students with auditory impairments.

Еще

Assistive Technologies, Bilstm, Connectionist Temporal Classification, Dual-Modality Speech Recognition System, Kanavnet (Kannada Audio-Visual Network), MFCC-Mel-Frequency Cepstral Coefficients

Короткий адрес: https://sciup.org/15020307

IDR: 15020307 | DOI: 10.5815/ijigsp.2026.02.06