Kimi-Audio Technical Report

要約

オーディオの理解、生成、会話に優れたオープンソースオーディオファンデーションモデルであるKimi-Audioを紹介します。
モデルアーキテクチャ、データキュレーション、トレーニングレシピ、推論の展開、評価など、Kimi-Audioの構築における実践について詳しく説明します。
具体的には、12.5Hzのオーディオトークンザーを活用し、入力として連続的な機能を備えた新しいLLMベースのアーキテクチャを出力として離散トークンとして設計し、フローマッチングに基づいてチャンクワイズストリーミングデテクエイザーを開発します。
スピーチ、サウンド、音楽などの幅広いモダリティをカバーする1300万時間以上のオーディオデータで構成されるトレーニング前のデータセットをキュレートし、高品質で多様なトレーニング後のデータを構築するパイプラインを構築します。
事前に訓練されたLLMから初期化されたKimi-Audioは、いくつかの慎重に設計されたタスクを備えたオーディオデータとテキストデータの両方で継続的に事前に訓練されており、さまざまなオーディオ関連のタスクをサポートするために微調整されています。
広範な評価によると、Kimi-Audioは、音声認識、オーディオ理解、オーディオ質問の回答、音声会話など、さまざまなオーディオベンチマークで最先端のパフォーマンスを達成しています。
https://github.com/moonshotai/kimi-audioで、コード、モデルチェックポイント、および評価ツールキットをリリースします。

要約(オリジナル)

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

arxiv情報

著者	KimiTeam,Ding Ding,Zeqian Ju,Yichong Leng,Songxiang Liu,Tong Liu,Zeyu Shang,Kai Shen,Wei Song,Xu Tan,Heyi Tang,Zhengtao Wang,Chu Wei,Yifei Xin,Xinran Xu,Jianwei Yu,Yutao Zhang,Xinyu Zhou,Y. Charles,Jun Chen,Yanru Chen,Yulun Du,Weiran He,Zhenxing Hu,Guokun Lai,Qingcheng Li,Yangyang Liu,Weidong Sun,Jianzhou Wang,Yuzhi Wang,Yuefeng Wu,Yuxin Wu,Dongchao Yang,Hao Yang,Ying Yang,Zhilin Yang,Aoxiong Yin,Ruibin Yuan,Yutong Zhang,Zaida Zhou
発行日	2025-04-25 15:31:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Kimi-Audio Technical Report

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー