Weighted Late Fusion based Deep Attention Neural Network for Detecting Multi-Modal Emotion
Автор: Srinivas P.V.V.S., Shaik Nazeera Khamar, Nohith Borusu, Mohan Guru Raghavendra Kota, Harika Vuyyuru, Sampath Patchigolla
Журнал: International Journal of Image, Graphics and Signal Processing @ijigsp
Статья в выпуске: 1 vol.18, 2026 года.
Бесплатный доступ
In the field of affective computing research, multi-modal emotion detection has gained popularity as a way to boost recognition robustness and get around the constraints of processing a multiple type of data. Human emotions are utilized for defining a variety of methodologies, including physiological indicators, facial expressions, as well as neuroimaging tactics. Here, a novel deep attention mechanism is used for detecting multi-modal emotions. Initially, the data are collected from audio and video features. For dimensionality reduction, the audio features are extracted using Constant-Q chromagram and Mel-Frequency Cepstral Coefficients (MM-FC2). After extraction, the audio generation is carried out by a Convolutional Dense Capsule Network (Conv_DCN) is used. Next is video data; the key frame extraction is carried out using Enhanced spatial-temporal and Second-Order Gaussian kernels. Here, Second-Order Gaussian kernels are a powerful tool for extracting features from video data and converting it into a format suitable for image-based analysis. Next, for video generation, DenseNet-169 is used. At last, all the extracted features are fused, and emotions are detected using a Weighted Late Fusion Deep Attention Neural Network (WLF_DAttNN). Python tool is used for implementation, and the performance measure achieved an accuracy of 97% for RAVDESS and 96% for CREMA-D dataset.
Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), 3D-Convolutional Neural Network (3D-CNN), Mel-Frequency Cepstral Coefficients (MFCCs)
Короткий адрес: https://sciup.org/15020142
IDR: 15020142 | DOI: 10.5815/ijigsp.2026.01.07