YuE: Scaling Open Foundation Models for Long-Form Music Generation

要約

LLAMA2アーキテクチャに基づいたオープンファンデーションモデルのファミリーであるYueを紹介することにより、長い形式の音楽生成のタスクに取り組みます。
具体的には、数兆個のトークンをスケールし、叙情的なアライメント、コヒーレントな音楽構造、適切な伴奏でボーカルメロディーを魅了しながら、最大5分間の音楽を生成します。
これは、（1）密な混合シグナルを克服するためのトラックが分類された次のトークン予測、（2）長いコンテキストリリカルアライメントのための構造的進行状態、および（3）マルチタスク、マルチフェーズプレイングレシピを収束および一般化するための多相レシピを介して達成します。
さらに、音楽生成のコンテキスト内学習手法を再設計し、多目的なスタイルの転送（例えば、日本の都市ポップを元の伴奏を維持しながら英語のラップに変換する）と双方向の世代を再設計します。
広範な評価を通じて、Yueが音楽性と声の敏ility性において独自のシステムの一部を一致させるか、それを上回っていることを実証します。
さらに、微調整Yueにより、追加のコントロールと尾言語のサポートが強化されます。
さらに、世代を超えて、Yueの学んだ表現は、Yueの結果が大理石のベンチマークで一致するか、最先端の方法を超える音楽を理解するタスクでうまく機能できることを示しています。
キーワード：歌詞2Song、歌の生成、長型、基礎モデル、音楽生成

要約(オリジナル)

We tackle the task of long-form music generation–particularly the challenging \textbf{lyrics-to-song} problem–by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE’s learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

arxiv情報

著者	Ruibin Yuan,Hanfeng Lin,Shuyue Guo,Ge Zhang,Jiahao Pan,Yongyi Zang,Haohe Liu,Yiming Liang,Wenye Ma,Xingjian Du,Xinrun Du,Zhen Ye,Tianyu Zheng,Yinghao Ma,Minghao Liu,Zeyue Tian,Ziya Zhou,Liumeng Xue,Xingwei Qu,Yizhi Li,Shangda Wu,Tianhao Shen,Ziyang Ma,Jun Zhan,Chunhui Wang,Yatian Wang,Xiaowei Chi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Shansong Liu,Lingrui Mei,Peng Li,Junjie Wang,Jianwei Yu,Guojian Pang,Xu Li,Zihao Wang,Xiaohuan Zhou,Lijun Yu,Emmanouil Benetos,Yong Chen,Chenghua Lin,Xie Chen,Gus Xia,Zhaoxiang Zhang,Chao Zhang,Wenhu Chen,Xinyu Zhou,Xipeng Qiu,Roger Dannenberg,Jiaheng Liu,Jian Yang,Wenhao Huang,Wei Xue,Xu Tan,Yike Guo
発行日	2025-03-11 17:26:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

YuE: Scaling Open Foundation Models for Long-Form Music Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー