Abstract
Sentiment analysis in social media has gained substantial attention due to the rapid growth of multimedia content across digital platforms. Traditional sentiment analysis techniques primarily relied on textual information, which has limited the capability of capturing the rich emotional cues that appear in audio signals and visual expressions. Social media posts frequently contain videos that integrate speech, facial expressions, and textual captions. These heterogeneous modalities carry complementary emotional information that conventional unimodal models have struggled to interpret effectively. The inability of earlier systems to integrate multimodal information has created limitations in sentiment classification accuracy and contextual understanding. To address this challenge, the present study has introduced a Fusion Transformer for Multimodal Sentiment Analysis (FTMSA), which has integrated audio, visual, and textual modalities into a unified representation framework. The proposed architecture has utilized transformer based attention mechanisms that have captured inter modal relationships among speech tone, facial features, and textual semantics. A feature extraction module has processed textual embeddings through contextual language representation, while acoustic descriptors have represented speech characteristics and visual encoders have captured facial emotional cues. These heterogeneous features have been fused through a cross modal attention transformer that has learned correlations among modalities. The training procedure has employed supervised learning that has optimized sentiment classification performance across multimodal inputs. Experimental evaluation has demonstrated that the proposed FTMSA model has achieved improved sentiment recognition accuracy when compared with conventional unimodal and early fusion techniques. The experimental evaluation demonstrates that the proposed FTMSA achieves a maximum accuracy of 93.2%, precision of 92.3%, recall of 91.3%, F1 score of 91.8%, and specificity of 92.7%, outperforming existing methods such as MAN, RMNN, and TBMM. The model maintains superior performance across varying training epochs and dataset sizes, validating the effectiveness of the cross modal attention mechanism in capturing textual, acoustic, and visual sentiment cues for accurate prediction.
Authors
K.S. Suresh1, S. Vamshi Krushna2
Rajeswari Vedachalam Government Arts College, India1, Vignana Bharathi Institute of Technology, India2
Keywords
Multimodal Sentiment Analysis, Fusion Transformer, Audio Visual Textual Features, Social Media Analytics, Cross Modal Attention