STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

要約

画像拡散モデルは、GAN ベースの手法における過度の平滑化の問題に対処するために、現実世界のビデオの超解像度に適応されています。
ただし、これらのモデルは静止画像でトレーニングされるため、時間的な一貫性を維持するのが難しく、時間的なダイナミクスを効果的にキャプチャする能力が制限されます。
時間モデリングを改善するために、テキストからビデオ (T2V) モデルをビデオ超解像度に統合するのは簡単です。
ただし、2 つの重要な課題が残っています。それは、現実世界のシナリオにおける複雑な劣化によって生じるアーティファクトと、強力な T2V モデル (\textit{例}、CogVideoX-5B) の強力な生成能力による忠実度の低下です。
復元されたビデオの時空間品質を向上させるために、\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal) を導入します。
-世界ビデオ超解像度)、現実世界のビデオ超解像度に T2V モデルを活用する新しいアプローチで、現実的な空間詳細と堅牢な時間的一貫性を実現します。
具体的には、グローバルアテンションブロックの前にローカル情報拡張モジュール (LIEM) を導入して、ローカルの詳細を強化し、劣化アーティファクトを軽減します。
さらに、忠実度を強化するために動的周波数 (DF) 損失を提案し、拡散ステップ全体で異なる周波数成分に焦点を当てるようにモデルを導きます。
広範な実験により、\textbf{~\name}~合成データセットと現実世界のデータセットの両方で最先端の手法を上回るパフォーマンスが実証されました。

要約(オリジナル)

Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.

arxiv情報

著者	Rui Xie,Yinhong Liu,Penghao Zhou,Chen Zhao,Jun Zhou,Kai Zhang,Zhenyu Zhang,Jian Yang,Zhenheng Yang,Ying Tai
発行日	2025-01-06 12:36:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー