Abstract
The exponential growth of multimedia content in cloud environments
has created the need for advanced, real-time processing techniques.
Traditional deep learning models, while powerful, often face
bottlenecks in handling high-dimensional streaming data efficiently.
Conventional vision transformer architectures exhibit computational
overhead and delayed decision-making when processing large-scale
multimedia streams in distributed cloud systems, impacting latency and
accuracy. This study proposes an improvised swarm decision
mechanism integrated into vision transformers (VT-SwarmNet) for
efficient large-scale multimedia stream analysis. The approach
combines swarm intelligence for dynamic token selection with
transformer-based feature encoding. Data streams are pre-processed in
the cloud using distributed computing, partitioned into manageable
chunks, and processed in parallel. Swarm agents prioritize salient
tokens, improving attention allocation and reducing redundant
computations. Experiments conducted on a large-scale multimedia
dataset in a simulated cloud environment demonstrated that VT
SwarmNet achieved 12.4% higher accuracy, 18.7% lower latency, and
15.3% better F1-score compared to leading baseline methods. The
integration of swarm-based decision-making reduced processing
overhead while maintaining superior feature extraction.
Authors
S. Vimala1, D.K. Mohanty2, Karthikeyan Thangavel3
Prathyusha Engineering College, India1, Government B.Ed. Training College Kalinga, India2, University of Technology and Applied Sciences, The Sultanate of Oman3
Keywords
Vision Transformers, Swarm Intelligence, Multimedia Streaming, Cloud Computing, Deep Learning