L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

要約

Reasoning Languageモデルは、「Thinking long」、つまり、より長い考え方のシーケンスを生成し、したがってより多くのコンピューティングを使用することにより、テスト時にパフォーマンスを改善する不気味な能力を示しています。
ただし、考え方の連鎖推論の長さは制御できないため、テスト時間計算を割り当てることは不可能であり、望ましいレベルのパフォーマンスを実現します。
長さ制御されたポリシー最適化（LCPO）を導入します。これは、ユーザー指定の長さの制約の精度と順守を最適化する簡単な強化学習方法です。
LCPOを使用して、プロンプトに与えられた長さの制約を満たす出力を生成する推論言語モデルであるL1をトレーニングします。
L1の長さ制御により、幅広いタスクで計算コストと精度をスムーズに取引することができ、長さコントロールの最先端のS1メソッドを上回ります。
さらに、LCPOで訓練されたモデルで、予想外の短い考え方の能力を明らかにします。
たとえば、1.5B L1モデルは、等しい推論長でGPT-4Oを上回ります。
全体として、LCPOは推論長を正確に制御できるようにし、テスト時間の計算と精度の微調整された割り当てを可能にします。
https://www.cmu-l3.github.io/l1でコードとモデルをリリースします

要約(オリジナル)

Reasoning language models have shown an uncanny ability to improve performance at test-time by “thinking longer”-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

arxiv情報

著者	Pranjal Aggarwal,Sean Welleck
発行日	2025-03-06 18:43:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー