Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

要約

非常に一般的な種類のビデオとして、顔ビデオは映画、トークショー、生放送、その他のシーンでよく使用されます。
実際のオンラインビデオは、高い通信コストと限られた伝送帯域幅による高い圧縮率により、ぼやけや量子化ノイズなどの劣化に悩まされることがよくあります。
人間の視覚システムは顔の細部に非常に敏感であるため、これらの劣化は顔ビデオに特に深刻な影響を与えます。
ビデオ顔強調の大幅な進歩にも関わらず、現在の方法は依然として $i)$ 長い処理時間と $ii)$ 一貫性のない時空間視覚効果 (ちらつきなど) に悩まされています。
この研究では、効果的なちらつき防止メカニズムを使用して、圧縮された低品質バージョンから高品質ビデオを復元することで、上記の 2 つの課題を克服する、斬新で効率的なブラインドビデオ顔強調方法を提案します。
特に、提案された方法は、高品質のポートレート特徴と残差ベースの時間情報を記録する時空間コードブックに関連付けられた 3D-VQGAN バックボーンに基づいて開発されています。
モデル用の 2 段階の学習フレームワークを開発します。
ステージ \Rmnum{1} では、コードブックの崩壊問題を軽減する正則化機能を備えたモデルを学習します。
ステージ \Rmnum{2} では、コードブックからコードを検索し、低品質ビデオのエンコーダーをさらに更新する 2 つのトランスフォーマーを学習します。
VFHQ-Test データセットで行われた実験では、私たちの方法が、効率と有効性の両方において、現在の最先端のブラインドフェイスビデオ復元およびフリッカー除去方法を上回っていることが実証されています。
コードは \url{https://github.com/Dixin-Lab/BFVR-STC} で入手できます。

要約(オリジナル)

As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{https://github.com/Dixin-Lab/BFVR-STC}.

arxiv情報

著者	Yutong Wang,Jiajie Teng,Jiajiong Cao,Yuming Li,Chenguang Ma,Hongteng Xu,Dixin Luo
発行日	2024-11-25 15:14:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー