A Multi-Modal Transformer Model with Gated-LSTM for Sarcasm Detection in Tweets Using Cross-Attention and Emoji Integration

Shaikh Ambreen Mohd Ibrahim; Manoj M. Deshpande; Vijaykumar N. Pawar

doi:10.5815/ijieeb.2026.03.02

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Artificial intelligence

A Multi-Modal Transformer Model with Gated-LSTM for Sarcasm Detection in Tweets Using Cross-Attention and Emoji Integration

Автор: Shaikh Ambreen Mohd Ibrahim, Manoj M. Deshpande, Vijaykumar N. Pawar

Журнал: International Journal of Information Engineering and Electronic Business @ijieeb

Статья в выпуске: 3 vol.18, 2026 года.

Бесплатный доступ

In the era of social media-driven communication, sarcasm poses a big challenge for the automated sentiment analysis systems, much more on platforms like Twitter, due to the brevity and often contextually ambiguous nature of the text. Misinterpretation of sarcastic content may degrade the reliability of downstream analytics, encompassing opinion mining and content moderation. To address this challenge, we propose, in this paper, a multi-modal transformer-based approach to sarcasm detection, which integrates textual and emoji information through the use of a cross-attention mechanism. The proposed model utilizes RoBERTa for the contextual processing of textual content to generate contextualized text embeddings, whereas emojis are encoded using Emoji-BERT to capture emoji-specific semantic and emotional cuing. A Gated-LSTM layer has been employed to model sequential dependencies among emojis, and a cross-attention mechanism dynamically aligns emoji representations with textual features for enhancing the sarcasm recognition capability. Later, these fused representations are passed to a fully connected classification layer for predicting sarcasm. For the evaluation of the performance of our proposed model against state-of-the-art results, standard metrics of evaluation have been considered. Experimental results demonstrate that the proposed approach outperforms several baseline and state-of-the-art models, with an accuracy of 92.5%, precision of 91.8%, recall of 93.2%, and an F1-score of 92.5%. From these results, we learn that jointly modeling textual and emoji modalities improves the performance of sarcasm detection in social media content. Also, these findings illustrate the potential of the suggested approach in improving sarcasm-aware sentiment analysis in the realm of social media analytics and automated content moderation systems.

Sarcasm Detection, Multi-Modal Learning, Cross-Attention, Sentiment Analysis, Emoji Integration and Transformer Models

Короткий адрес: https://sciup.org/15020378

IDR: 15020378 | DOI: 10.5815/ijieeb.2026.03.02