Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

要約

大規模なオーディオ言語モデル（LALMS）は、インテリジェントなヒューマンコンピューターの相互作用を大幅に進めていますが、テキストベースの出力への依存により、自然な音声応答を直接生成する能力が制限され、シームレスなオーディオインタラクションが妨げられます。
これに対処するために、Audio-Query-Audio Answer（AQAA）タスク用に設計された完全なエンドツーエンドのラルムであるStep-Audio-Aqaaを紹介します。
このモデルは、言語およびセマンティック特徴の抽出用のデュアルコードブックオーディオトークナイザー、130億パラメーターバックボーンLLM、および高忠実度の音声合成のためのニューラルボコーダーを統合します。
トレーニング後のアプローチでは、インターリーブしたテキストとオーディオのトークン出力を採用してセマンティックコヒーレンスを強化し、直接優先最適化（DPO）とモデルマージを組み合わせてパフォーマンスを向上させます。
Stepeval-Audio-360ベンチマークの評価は、Step-Audio-AQAAが特に音声制御に優れており、主要な領域の最先端のラームを上回ることを示しています。
この作業は、エンドツーエンドのラームの有望なソリューションに貢献し、AQAAタスクの全体的なパフォーマンスを向上させる上でトークンベースのボコーダーの重要な役割を強調しています。

要約(オリジナル)

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

arxiv情報

著者	Ailin Huang,Bingxin Li,Bruce Wang,Boyong Wu,Chao Yan,Chengli Feng,Heng Wang,Hongyu Zhou,Hongyuan Wang,Jingbei Li,Jianjian Sun,Joanna Wang,Mingrui Chen,Peng Liu,Ruihang Miao,Shilei Jiang,Tian Fei,Wang You,Xi Chen,Xuerui Yang,Yechang Huang,Yuxiang Zhang,Zheng Ge,Zheng Gong,Zhewei Huang,Zixin Zhang,Bin Wang,Bo Li,Buyun Ma,Changxin Miao,Changyi Wan,Chen Xu,Dapeng Shi,Dingyuan Hu,Enle Liu,Guanzhe Huang,Gulin Yan,Hanpeng Hu,Haonan Jia,Jiahao Gong,Jiaoren Wu,Jie Wu,Jie Yang,Junzhe Lin,Kaixiang Li,Lei Xia,Longlong Gu,Ming Li,Nie Hao,Ranchen Ming,Shaoliang Pang,Siqi Liu,Song Yuan,Tiancheng Cao,Wen Li,Wenqing He,Xu Zhao,Xuelin Zhang,Yanbo Yu,Yinmin Zhong,Yu Zhou,Yuanwei Liang,Yuanwei Lu,Yuxiang Yang,Zidong Yang,Zili Zhang,Binxing Jiao,Heung-Yeung Shum,Jiansheng Chen,Jing Li,Xiangyu Zhang,Xinhao Zhang,Yibo Zhu,Daxin Jiang,Shuchang Zhou,Chen Hu
発行日	2025-06-10 16:37:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー