Abstract
With the exponential growth in multimedia content across platforms,
real-time video understanding—particularly object segmentation and
tracking—has become a cornerstone in applications such as
surveillance, autonomous navigation, and augmented reality.
Conventional video segmentation and tracking techniques often
struggle with real-time processing, occlusion handling, and scale
variation in dynamic environments. While deep learning models like
YOLOv8 are highly efficient in object detection, their capability in fine
grained segmentation and continuous object identity tracking remains
underexplored. This paper introduces a novel Generative YOLOv8
based architecture that integrates segmentation-aware heads and
temporal attention modules for accurate instance segmentation and
object tracking. A generative adversarial refinement network is
employed to enhance boundary precision and motion continuity. The
model leverages video frame sequences, producing temporal-aware
object masks while maintaining consistent object IDs across frames.
Experimental evaluations on the DAVIS and MOT20 datasets
demonstrate superior performance of the proposed model, achieving
real-time inference speeds (~35 FPS) with a mIoU of 82.3% and IDF1
score of 84.7%, outperforming several state-of-the-art trackers and
segmenters. The framework exhibits robust performance under
occlusion, fast motion, and cluttered backgrounds, making it highly
suitable for advanced multimedia applications.
Authors
Renuka Deshpande, T.V. Saroja
Shivajirao S Jondhale College of Engineering, India
Keywords
Video Segmentation, Object Tracking, YOLOv8, Deep Learning, Multimedia Analytics