RECURRENT TRANSFORMER BASED FRAMEWORK FOR VIDEO DENOISING AND SUPER RESOLUTION USING OPTICAL FLOW AND TEMPORAL ATTENTION

ICTACT Journal on Image and Video Processing ( Volume: 16 , Issue: 3 )

Abstract

Video restoration has remained an important task in multimedia processing because visual data captured in real environments often contain noise, motion artifacts, and resolution degradation. The demand for high-quality video has increased with the growth of surveillance systems, streaming platforms, and intelligent vision applications. Traditional denoising and super-resolution approaches have relied on spatial filtering and convolutional neural networks. However, these techniques have faced limitations in modeling long-range temporal dependencies across frames. As a result, inconsistent textures, motion blur, and temporal flickering have frequently appeared in restored videos. The present study has addressed these challenges by introducing a Recurrent Optical Flow Transformer (ROFT), a recurrent transformer architecture that has integrated optical flow estimation with temporal attention for joint video denoising and super-resolution. The proposed framework has utilized a recurrent transformer module that has captured temporal correlations between adjacent frames while maintaining spatial consistency. An optical flow estimation unit has guided the alignment of frames, which has reduced motion distortion and misalignment during reconstruction. In addition, a temporal attention mechanism that has analyzed contextual dependencies across multiple frames has enhanced feature representation for dynamic regions. The network has processed sequential frames through recurrent connections that have preserved temporal memory and improved reconstruction stability. Experiments have been conducted on benchmark video restoration datasets that contained noisy and low-resolution sequences. The experimental evaluation demonstrates that the proposed ROFTT framework achieves superior performance compared with existing approaches. The model produces a PSNR value of 35.8 dB and an SSIM value of 0.97, which indicate improved reconstruction quality and structural preservation. The reconstruction error decreases to 0.005 MSE, while the temporal consistency error reduces to 0.007, which confirms stable frame transitions across video sequences. Furthermore, the model achieves an FSIM value of 0.995, which indicates strong preservation of perceptual texture features. These results demonstrate that the proposed architecture effectively integrates optical flow alignment and temporal transformer attention that enhances both spatial detail recovery and temporal coherence in restored video frames.

Authors

Pitty Nagarjuna1, B.K. Harsha2
Indian Institute of Science, Bengaluru, India1, REVA University, India2

Keywords

Video Denoising, Super Resolution, Recurrent Transformer, Optical Flow Estimation, Temporal Attention

Published By
ICTACT
Published In
ICTACT Journal on Image and Video Processing
( Volume: 16 , Issue: 3 )
Date of Publication
February 2026
Pages
3779 - 3789
Page Views
56
Full Text Views
6