Tell What You Hear From What You See — Video to Audio Generation Through Text

要約

ビジュアルシーンとオーディオシーンのコンテンツは多面的であり、ビデオとさまざまなオーディオを組み合わせたり、その逆を行うことができます。
したがって、ビデオからオーディオへの生成タスクでは、生成されたオーディオを制御するためのステアリングアプローチを導入することが不可欠です。
ビデオからオーディオへの生成は十分に確立された生成タスクですが、既存の方法にはそのような制御性がありません。
この研究では、ビデオとオプションのテキストプロンプトを入力として受け取り、オーディオとオプションでオーディオのテキスト説明を生成するマルチモーダル生成フレームワークである VATT を提案します。
このようなフレームワークには 2 つの利点があります。i) ビデオからオーディオへの生成プロセスは、視覚情報のコンテキストを補完するテキストを介して洗練および制御できます。ii) モデルは、音声キャプションを生成することにより、ビデオに対してどのような音声を生成するかを提案できます。
VATT は 2 つの主要なモジュールで構成されます。VATT Converter は、命令用に微調整された LLM で、ビデオの特徴を LLM ベクトル空間にマッピングする投影レイヤーを含みます。
VATT Audio は、反復並列デコードを使用してビジュアルフレームおよびオプションのテキストプロンプトからオーディオトークンを生成するトランスフォーマーです。
オーディオトークンは、事前トレーニングされたニューラルコーデックによって波形に変換されます。
実験によると、VATT を客観的な基準で既存のビデオからオーディオへの生成方法と比較した場合、音声キャプションが提供されていない場合でも、同等のパフォーマンスを達成できることが示されています。
音声キャプションがプロンプトとして提供される場合、VATT はさらに洗練されたパフォーマンス (最低 KLD スコア 1.41) を達成します。
さらに、主観的な調査では、既存の方法で生成されたオーディオよりも VATT Audio が優先的に生成されたオーディオとして選択されていることが示されています。
VATT により、テキストによる制御可能なビデオからオーディオへの生成が可能になるだけでなく、オーディオキャプションを通じてビデオのテキストプロンプトが提案され、テキストガイドによるビデオからオーディオへの生成やビデオからオーディオへのキャプション作成などの新しいアプリケーションが可能になります。

要約(オリジナル)

The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

arxiv情報

著者	Xiulong Liu,Kun Su,Eli Shlizerman
発行日	2024-11-08 16:29:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tell What You Hear From What You See — Video to Audio Generation Through Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー