AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

要約

私たちは、一時的に調整されたクロスモーダルコンディショニングのためのフリーズされたビデオとオーディオの拡散モデルのアクティブ化を活用する、ビデオからオーディオおよびオーディオからビデオの生成のための統合フレームワークである AV-Link を提案します。
私たちのフレームワークの鍵となるのは、時間的に調整された自己注意操作を通じて、バックボーンのビデオとオーディオの拡散モデル間の双方向の情報交換を可能にするフュージョンブロックです。
コンディショニング信号の他のタスク用に事前トレーニングされた特徴抽出器を使用する以前の研究とは異なり、AV-Link は、単一のフレームワーク内の相補モダリティによって取得された特徴、つまり、オーディオを生成するビデオ特徴、またはビデオを生成するオーディオ特徴を直接活用できます。
私たちは設計の選択を広範に評価し、同期された高品質のオーディオビジュアルコンテンツを実現する私たちの方法の能力を実証し、イマーシブメディア生成におけるアプリケーションの可能性を示します。
プロジェクトページ:snap-research.github.io/AVLink/

要約(オリジナル)

We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/

arxiv情報

著者	Moayed Haji-Ali,Willi Menapace,Aliaksandr Siarohin,Ivan Skorokhodov,Alper Canberk,Kwot Sin Lee,Vicente Ordonez,Sergey Tulyakov
発行日	2024-12-19 18:57:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー