Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

要約

最近、強化学習（RL）は、大規模な言語モデル（LLM）の推論能力を大幅に強化することが示されており、RLベースのアプローチは視覚的なマルチモーダルタスクに徐々に適用されています。
ただし、これらの開発では、オーディオモダリティはほとんど見落とされています。
したがって、オーディオの理解と推論で一連のRL探索を実施し、特にオーディオ質問応答（AQA）タスクに焦点を当てています。
グループ相対ポリシー最適化（GRPO）アルゴリズムをQWEN2-AUDIO-7B-Instructに活用し、私たちの実験では、MMAUテストMINIベンチマークで最先端のパフォーマンスを実証し、64.5％の精度を達成しました。
この技術レポートの主な調査結果は次のとおりです。1）GRPOアルゴリズムは、モデルに8.2Bパラメーターしかない場合でも、大規模なオーディオ言語モデル（LALMS）に効果的に適用できます。
2）トレーニング後のサンプルはわずか38kで、RLは監視された微調整（SFT）を大幅に上回り、RLベースのアプローチが大規模なデータセットなしでは効果的であることを示しています。
3）明示的な推論プロセスは、AQAタスクに大きな利点を示していません。また、深い思考を効率的に利用する方法は、さらなる研究のための未解決の問題のままです。
4）ラームはまだ人間の聴覚言語の推論にはるかに遅れており、RLベースのアプローチがさらなる調査が必要であることを示唆しています。
当社のプロジェクトは、https：//github.com/xiaomi-research/r1-aqaおよびhttps://huggingface.co/mispeech/r1-aqaで入手できます。

要約(オリジナル)

Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.

arxiv情報

著者	Gang Li,Jizhong Liu,Heinrich Dinkel,Yadong Niu,Junbo Zhang,Jian Luan
発行日	2025-03-19 16:33:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー