Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

要約

ビデオ – オーディオ (V2A) 生成は、サイレントビデオからコンテンツに一致するオーディオを合成することを目的としていますが、高い生成品質、効率、および視覚と音声の時間的同期性を備えた V2A モデルを構築することは依然として困難です。
我々は、整流流マッチングに基づく V2A モデル Frieren を提案します。
フリーレンは、条件付きトランスポートベクトル場をノイズから直線パスで潜在的なスペクトログラムに回帰し、ODE を解くことによってサンプリングを実行し、オーディオ品質の点で自己回帰モデルやスコアベースのモデルを上回ります。
フィードフォワード変換器に基づく非自己回帰ベクトル場推定器と、強力な時間的アライメントを備えたチャネルレベルのクロスモーダル特徴融合を採用することにより、私たちのモデルは入力ビデオと高度に同期したオーディオを生成します。
さらに、ガイド付きベクトル場を使用したリフローと 1 ステップの蒸留により、私たちのモデルは、少数の、または 1 つのサンプリングステップだけで適切なオーディオを生成できます。
実験の結果、Frieren は VGGSound 上で生成品質と時間的アライメントの両方において最先端のパフォーマンスを達成し、アライメント精度は 97.22% に達し、強力な拡散ベースのベースラインと比較して開始スコアが 6.2% 向上したことが示されています。
音声サンプルは http://frieren-v2a.github.io で入手できます。

要約(オリジナル)

Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and conducts sampling by solving ODE, outperforming autoregressive and score-based models in terms of audio quality. By employing a non-autoregressive vector field estimator based on a feed-forward transformer and channel-level cross-modal feature fusion with strong temporal alignment, our model generates audio that is highly synchronized with the input video. Furthermore, through reflow and one-step distillation with guided vector field, our model can generate decent audio in a few, or even only one sampling step. Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment on VGGSound, with alignment accuracy reaching 97.22%, and 6.2% improvement in inception score over the strong diffusion-based baseline. Audio samples are available at http://frieren-v2a.github.io .

arxiv情報

著者	Yongqi Wang,Wenxiang Guo,Rongjie Huang,Jiawei Huang,Zehan Wang,Fuming You,Ruiqi Li,Zhou Zhao
発行日	2024-07-09 15:55:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー