Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

要約

音声から画像への生成モデルをトレーニングするには、意味的に整合した多様な音声と映像のペアが豊富に必要です。
このようなデータは、ビデオに固有のクロスモーダルな意味上の対応関係を考慮して、ほとんどの場合、実際のビデオから厳選されています。
この研究では、グラウンドトゥルースのオーディオビジュアル対応の絶対的な必要性を主張することは、不必要であるだけでなく、データの規模、品質、多様性に厳しい制限をもたらし、最終的には現代の生成システムでの使用を損なうという仮説を立てています。
モデル。
つまり、我々は、現代の視覚言語モデルの推論機能によって強化された検索プロセスを通じて、さまざまな高品質でありながら素性のある単峰性の起源からのインスタンスを人工的にペアにすることができる、スケーラブルな画像音響化フレームワークを提案します。
このアプローチの有効性を実証するために、ソニファイド画像を使用して、最先端の技術に匹敵するパフォーマンスを発揮する音声から画像への生成モデルをトレーニングします。
最後に、一連のアブレーション研究を通じて、セマンティックミキシングと補間、ラウドネスキャリブレーション、残響による音響空間モデリングなど、いくつかの興味深い聴覚機能を示します。これらの機能は、画像生成プロセスを導くために私たちのモデルが暗黙的に開発したものです。

要約(オリジナル)

Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.

arxiv情報

著者	Darius Petermann,Mahdi M. Kalayeh
発行日	2025-01-09 18:13:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー