TTRL: Test-Time Reinforcement Learning

要約

このホワイトペーパーでは、大規模な言語モデル（LLM）のタスクを推論するための明示的なラベルなしで、データの補強学習（RL）を調査します。
問題の中心的な課題は、グラウンドトゥルース情報にアクセスしていない間、推論中の報酬の推定です。
この設定はとらえどころのないように見えますが、多数票などのテスト時間スケーリング（TTS）の一般的なプラクティスは、RLトレーニングの運転に適した驚くほど効果的な報酬をもたらすことがわかります。
この作業では、ラベルのないデータでRLを使用してLLMをトレーニングするための新しい方法であるテスト時間強化学習（TTRL）を導入します。
TTRLは、事前に訓練されたモデルで事前に使用することにより、LLMの自己進化を可能にします。
私たちの実験は、TTRLがさまざまなタスクやモデルのパフォーマンスを一貫して改善することを示しています。
特に、TTRLは、QWEN-2.5-MATH-7Bのパス@1パフォーマンスを、AIME 2024で約159％増加させます。
さらに、TTRLはMAJ@Nメトリックによってのみ監督されていますが、TTRLは初期モデルの上限を一貫して上回るパフォーマンスを実証し、グラウンドトゥルースラベルを使用してテストデータで直接トレーニングされたモデルのパフォーマンスにアプローチします。
実験的な調査結果は、さまざまなタスクにわたるTTRLの一般的な有効性を検証し、より広範なタスクとドメインのTTRLの可能性を強調しています。
Github：https：//github.com/prime-rl/ttrl

要約(オリジナル)

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL’s potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

arxiv情報

著者	Yuxin Zuo,Kaiyan Zhang,Shang Qu,Li Sheng,Xuekai Zhu,Biqing Qi,Youbang Sun,Ganqu Cui,Ning Ding,Bowen Zhou
発行日	2025-04-22 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TTRL: Test-Time Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー