Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

要約

自己反省と強化学習を通じて、大規模な言語モデルのパフォーマンスを改善する方法を探ります。
モデルが誤って回答したときにより良い自己反省を生成するためにモデルをインセンティブ化することにより、合成データを生成することが実行不可能でバイナリフィードバックのみが利用可能である場合でも、複雑で検証可能なタスクを解決するモデルの能力を強化できることを実証します。
私たちのフレームワークは2つの段階で動作します。まず、特定のタスクに失敗すると、モデルは以前の試みを分析する自己反射的な解説を生成します。
第二に、モデルには、コンテキストで自己反映を伴うタスクに対する別の試みが与えられます。
その後の試みが成功した場合、自己反射段階で生成されたトークンは報われます。
私たちの実験結果は、数学方程式の書き込みで34.7％の改善と、関数呼び出しで18.1％の改善で、さまざまなモデルアーキテクチャにわたる大幅なパフォーマンスの向上を示しています。
特に、小さな微調整モデル（15億から70億のパラメーター）は、10倍大きい同じファミリのモデルを上回ります。
したがって、私たちの小説のパラダイムは、限られた外部フィードバックを備えた挑戦的なタスクで自己改善できる、より有用で信頼できる言語モデルへの刺激的な経路です。

要約(オリジナル)

We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model’s ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.

arxiv情報

著者	Shelly Bensal,Umar Jamil,Christopher Bryant,Melisa Russak,Kiran Kamble,Dmytro Mozolevskyi,Muayad Ali,Waseem AlShikh
発行日	2025-05-30 15:49:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー