A Robust Hybrid Deep Learning Model for Multiclass Depression Classification from Speech Audio
Author: Neny Sulistianingsih, Galih Hendro Martono
Journal: International Journal of Image, Graphics and Signal Processing @ijigsp
Article in issue: 2 vol.18, 2026.
Free access
Depression remains one of the most prevalent and underdiagnosed mental health disorders globally, necessitating scalable, objective, and non-invasive diagnostic tools. Speech, as a rich biomarker of emotional and psychological states, offers a promising avenue for automated depression detection. This study proposes a robust hybrid deep learning framework that integrates Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), Bidirectional Long Short-Term Memory (BiLSTM), and Transformer architectures to classify depression severity into three levels: normal, mild, and severe. Using a curated multimodal dataset comprising 400 labeled audio recordings, we extract comprehensive acoustic features, including MFCC, Chroma, Spectrogram, Contrast, and Tonnetz representations. Models are evaluated using precision, recall, F1-score, and accuracy. Experimental results show that the proposed hybrid models outperform traditional architectures, achieving up to 99% accuracy and strong generalization across all classes. This study demonstrates the potential of attention-enhanced hybrid architectures in mental health assessment and provides a foundation for future deployment in clinical and real-world settings. Future work includes multimodal fusion with EEG data and the implementation of explainable AI for clinical interpretability.
Depression Detection, Speech Emotion Recognition, Hybrid Deep Learning, CNN, Transformer, GRU, BiLSTM, Mental Health Assessment
Short address: https://sciup.org/15020309
IDR: 15020309 | DOI: 10.5815/ijigsp.2026.02.08