Stable Audio Open

要約

オープン生成モデルはコミュニティにとって非常に重要であり、微調整が可能であり、新しいモデルを提示する際のベースラインとして機能します。
ただし、現在のテキスト音声変換モデルのほとんどは非公開であり、アーティストや研究者が構築するためにアクセスすることはできません。
ここでは、クリエイティブコモンズデータを使用してトレーニングされた新しいオープンウェイトテキスト音声変換モデルのアーキテクチャとトレーニングプロセスについて説明します。
私たちの評価では、このモデルのパフォーマンスがさまざまな指標において最先端のパフォーマンスに匹敵することが示されています。
特に、報告された FDopenl3 の結果 (世代のリアリズムを測定) は、44.1kHz での高品質のステレオサウンド合成の可能性を示しています。

要約(オリジナル)

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model’s performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

arxiv情報

著者	Zach Evans,Julian D. Parker,CJ Carr,Zack Zukowski,Josiah Taylor,Jordi Pons
発行日	2024-07-31 16:22:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Stable Audio Open

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー