ZeroSep: Separate Anything in Audio with Zero Training

要約

オーディオソースの分離は、マシンが複雑な音響環境を理解し、多数のオーディオアプリケーションを支えるための基本です。
現在の監督された深い学習アプローチは、強力ですが、広範なタスク固有のラベル付けされたデータの必要性によって制限され、現実世界の音響シーンの計り知れない変動とオープンセットの性質に一般化するのに苦労します。
生成基盤モデルの成功に触発されて、事前に訓練されたテキスト誘導オーディオ拡散モデルがこれらの制限を克服できるかどうかを調査します。
驚くべき発見をします。ゼロショットソースの分離は、適切な構成の下で事前に訓練されたテキスト誘導オーディオ拡散モデルを通じて純粋に達成できます。
Zerosepという名前の私たちの方法は、混合オーディオを拡散モデルの潜在スペースに逆転させ、テキストコンディショニングを使用して個々のソースを回復するために除去プロセスを導くことにより機能します。
タスク固有のトレーニングや微調整がなければ、Zerosepは識別的分離タスクの生成拡散モデルを再利用し、豊富なテキストプライアーを通じてオープンセットシナリオを本質的にサポートします。
Zerosepは、事前に訓練されたさまざまなテキスト誘導オーディオ拡散バックボーンと互換性があり、複数の分離ベンチマークで強力な分離パフォーマンスを提供し、監視された方法を上回ります。

要約(オリジナル)

Audio source separation is fundamental for machines to understand complex acoustic environments and underpins numerous audio applications. Current supervised deep learning approaches, while powerful, are limited by the need for extensive, task-specific labeled data and struggle to generalize to the immense variability and open-set nature of real-world acoustic scenes. Inspired by the success of generative foundation models, we investigate whether pre-trained text-guided audio diffusion models can overcome these limitations. We make a surprising discovery: zero-shot source separation can be achieved purely through a pre-trained text-guided audio diffusion model under the right configuration. Our method, named ZeroSep, works by inverting the mixed audio into the diffusion model’s latent space and then using text conditioning to guide the denoising process to recover individual sources. Without any task-specific training or fine-tuning, ZeroSep repurposes the generative diffusion model for a discriminative separation task and inherently supports open-set scenarios through its rich textual priors. ZeroSep is compatible with a variety of pre-trained text-guided audio diffusion backbones and delivers strong separation performance on multiple separation benchmarks, surpassing even supervised methods.

arxiv情報

著者	Chao Huang,Yuesheng Ma,Junxuan Huang,Susan Liang,Yunlong Tang,Jing Bi,Wenqiang Liu,Nima Mesgarani,Chenliang Xu
発行日	2025-05-29 16:31:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ZeroSep: Separate Anything in Audio with Zero Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー