CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

要約

音楽の類似性検索は、ストリーミングプラットフォームの大規模なコレクションから関連するコンテンツを管理および探索するための基本です。
このペーパーでは、テキストの説明の自由な性質を活用して音楽の類似性モデリングを導く新しいクロスモーダルコントラスト学習フレームワークを紹介し、複雑な音楽関係を捉える際の伝統的なユニモーダルアプローチの制限に対処します。
高品質のテキスト音楽ペアのデータの希少性を克服するために、このペーパーでは、オンラインスクレイピングとLLMベースのプロンプトを組み合わせたデュアルソースデータ収集アプローチを紹介します。
Exten1Sive実験は、提案されたフレームワークが、Huawei Musicストリーミングプラットフォームでの客観的なメトリック、主観的評価、および実際のA/Bテストを通じて、既存のベンチマークよりも大幅なパフォーマンスの改善を達成することを示しています。

要約(オリジナル)

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

arxiv情報

著者	Tristan Tsoi,Jiajun Deng,Yaolong Ju,Benno Weck,Holger Kirchhoff,Simon Lui
発行日	2025-05-23 15:34:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー