ICTACT Journals

A MULTIMODAL FUSION TRANSFORMER FRAMEWORK FOR ROBUST AUDIO-VISUAL TEXTUAL SENTIMENT ANALYSIS IN SOCIAL MEDIA CONTENT

ICTACT Journal on Image and Video Processing ( Volume: 16 , Issue: 3 )

Abstract

Sentiment analysis in social media has gained substantial attention due to the rapid growth of multimedia content across digital platforms. Traditional sentiment analysis techniques primarily relied on textual information, which has limited the capability of capturing the rich emotional cues that appear in audio signals and visual expressions. Social media posts frequently contain videos that integrate speech, facial expressions, and textual captions. These heterogeneous modalities carry complementary emotional information that conventional unimodal models have struggled to interpret effectively. The inability of earlier systems to integrate multimodal information has created limitations in sentiment classification accuracy and contextual understanding. To address this challenge, the present study has introduced a Fusion Transformer for Multimodal Sentiment Analysis (FTMSA), which has integrated audio, visual, and textual modalities into a unified representation framework. The proposed architecture has utilized transformer based attention mechanisms that have captured inter modal relationships among speech tone, facial features, and textual semantics. A feature extraction module has processed textual embeddings through contextual language representation, while acoustic descriptors have represented speech characteristics and visual encoders have captured facial emotional cues. These heterogeneous features have been fused through a cross modal attention transformer that has learned correlations among modalities. The training procedure has employed supervised learning that has optimized sentiment classification performance across multimodal inputs. Experimental evaluation has demonstrated that the proposed FTMSA model has achieved improved sentiment recognition accuracy when compared with conventional unimodal and early fusion techniques. The experimental evaluation demonstrates that the proposed FTMSA achieves a maximum accuracy of 93.2%, precision of 92.3%, recall of 91.3%, F1 score of 91.8%, and specificity of 92.7%, outperforming existing methods such as MAN, RMNN, and TBMM. The model maintains superior performance across varying training epochs and dataset sizes, validating the effectiveness of the cross modal attention mechanism in capturing textual, acoustic, and visual sentiment cues for accurate prediction.

Authors

K.S. Suresh¹, S. Vamshi Krushna²
Rajeswari Vedachalam Government Arts College, India¹, Vignana Bharathi Institute of Technology, India²

Keywords

Multimodal Sentiment Analysis, Fusion Transformer, Audio Visual Textual Features, Social Media Analytics, Cross Modal Attention

Published By

ICTACT

Published In

ICTACT Journal on Image and Video Processing
( Volume: 16 , Issue: 3 )

Date of Publication

February 2026

Pages

3801 - 3810

Doi

10.21917/ijivp.2026.0537

Page Views

6713

Full Text Views

View Issue

Article Details ICTACT Journals