Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

要約

このフレームワークは、コーディングと単体テストの生成能力を、相互作用の結果に基づいて共進化させる。このアプローチにより、柔軟でスケーラブルな学習が可能となり、単体テスターはコーダーのミスから直接学ぶことができる。ReasonFlux-Coder-7Bと14Bは、Qwen2.5-Instructモデルに最適化した後、コード生成精度を5.3%、Best-of-N精度を9.0%向上させ、同規模のQwen-Coder、DeepSeek-Coder、Seed-Coderを凌駕しています。また、テストタイム・スケーリングやエージェンティック・コーディングなどの下流タスクにも自然に適用され、ベースモデルに対して 8.1%の改善を達成しました。ロングCoTモデルでは、ReasonFlux-Coder-4BがQwen3-4Bを常に上回り、ユニットテスト生成において64.8%の推論効率を達成しました。また、ベースモデルに対する強化学習のための効果的な報酬モデルとしても機能することがわかった。プロジェクト: https://github.com/Gen-Verse/CURE

要約(オリジナル)

We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder’s mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE

arxiv情報

著者	Yinjie Wang,Ling Yang,Ye Tian,Ke Shen,Mengdi Wang
発行日	2025-06-03 17:58:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー