Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

要約

大規模言語モデルは、現在のトークンと以前のトークンの間の相関関係をモデル化する、一時的に一方向のアテンションメカニズムのおかげで、テキストやオーディオなどのストリーミングデータの生成において顕著な効果を示しています。
しかし、ライブビデオ処理のニーズが高まっているにもかかわらず、ビデオストリーミングは依然としてあまり研究されていません。
最先端のビデオ拡散モデルは、双方向の時間的注意を活用して、現在のフレームと周囲のすべての (つまり、将来を含む) フレーム間の相関関係をモデル化します。これにより、ストリーミングビデオの処理が妨げられます。
この問題に対処するために、特にライブストリーミングビデオ翻訳を対象とした、一方向の時間的注意を備えたビデオ拡散モデルを設計する最初の試みである Live2Diff を紹介します。
以前の作業と比較して、私たちのアプローチは、将来のフレームを使用せずに、現在のフレームをその先行フレームおよびいくつかの初期ウォームアップフレームと相関させることにより、時間的な一貫性と滑らかさを保証します。
さらに、KV キャッシュメカニズムとパイプライン処理を特徴とする高効率のノイズ除去スキームを使用して、インタラクティブなフレームレートでのストリーミングビデオ変換を容易にします。
広範な実験により、提案された注意メカニズムとパイプラインの有効性が実証され、時間的な滑らかさおよび/または効率の点で以前の方法を上回っています。

要約(オリジナル)

Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token and previous tokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leverage bi-directional temporal attention to model the correlations between the current frame and all the surrounding (i.e. including future) frames, which hinders them from processing streaming videos. To address this problem, we present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring a KV-cache mechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.

arxiv情報

著者	Zhening Xing,Gereon Fox,Yanhong Zeng,Xingang Pan,Mohamed Elgharib,Christian Theobalt,Kai Chen
発行日	2024-07-11 17:34:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー