Video-Guided Foley Sound Generation with Multimodal Controls

要約

ビデオのサウンドエフェクトを生成するには、多くの場合、現実のソースから大幅に逸脱した芸術的なサウンドエフェクトの作成と、サウンドデザインの柔軟な制御が必要になります。
この問題に対処するために、テキスト、オーディオ、ビデオによるマルチモーダルコンディショニングをサポートするビデオガイド付きサウンド生成用に設計されたモデルである MultiFoley を紹介します。
無音ビデオとテキストプロンプトを与えられた場合、MultiFoley を使用すると、ユーザーはきれいなサウンド (例: 風の音なしでスケートボードの車輪が回転する音) や、より奇抜なサウンド (例: ライオンの咆哮を猫の鳴き声のような音にする) を作成できます。
MultiFoley を使用すると、ユーザーはサウンドエフェクト (SFX) ライブラリからリファレンスオーディオを選択したり、調整用に部分ビデオを選択したりすることもできます。
私たちのモデルの主な新しさは、低品質オーディオを含むインターネットビデオデータセットとプロ仕様の SFX 録音の両方で共同トレーニングを行い、高品質の全帯域幅 (48kHz) オーディオ生成を可能にすることにあります。
自動評価と人間による研究を通じて、MultiFoley がさまざまな条件付き入力にわたって同期された高品質サウンドを生成することに成功し、既存の方法を上回るパフォーマンスを発揮することを実証しました。
ビデオ結果についてはプロジェクトページをご覧ください: https://ificl.github.io/MultiFoley/

要約(オリジナル)

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion’s roar sound like a cat’s meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

arxiv情報

著者	Ziyang Chen,Prem Seetharaman,Bryan Russell,Oriol Nieto,David Bourgin,Andrew Owens,Justin Salamon
発行日	2024-11-26 18:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video-Guided Foley Sound Generation with Multimodal Controls

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー