MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

要約

近年、音楽タグ付け、楽器分類、キー検出など、様々な音楽インフォマティクス理解タスクにおいて、自己教師付き学習（SSL）で事前に訓練された基礎モデルが成功を収めている。本稿では、音楽理解のための教師あり音楽表現学習モデルを提案する。ランダム射影や既存のニューラルコーデックを採用した先行研究とは異なり、MuQと名付けられた提案モデルは、Mel Residual Vector Quantization (Mel-RVQ) によって生成されたトークンを予測するように学習される。我々のMel-RVQは、メルスペクトル量子化のための残差線形射影構造を利用し、ターゲット抽出の安定性と効率を高め、より良い性能に導く。多種多様なダウンストリームタスクでの実験により、MuQはわずか0.9K時間のオープンソースの事前学習データで、これまでの自己教師付き音楽表現モデルを凌駕することが実証されました。データを16万時間以上にスケールアップし、反復学習を採用することで、モデルのパフォーマンスが一貫して向上します。このモデルは、MagnaTagATuneデータセットのゼロショット音楽タグ付けタスクにおいて、最先端の性能を達成した。コードとチェックポイントは https://github.com/tencent-ailab/MuQ でオープンソース化されている。

要約(オリジナル)

Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.

arxiv情報

著者	Haina Zhu,Yizhi Zhou,Hangting Chen,Jianwei Yu,Ziyang Ma,Rongzhi Gu,Yi Luo,Wei Tan,Xie Chen
発行日	2025-01-03 08:35:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー