Scaling Test-Time Compute Without Verification or RL is Suboptimal

要約

テスト時間計算のスケーリングに大きな進歩にもかかわらず、コミュニティで継続的な議論は、スケーリングにより継続的かつ効率的な改善を可能にするために拡大する方法です。
主に2つのアプローチがあります。まず、成功した検索または思考の痕跡を蒸留することです。
第二に、強化学習（RL）と検索アルゴリズムを導くために、検証（0/1結果の報酬、報酬モデル、または検証剤）を使用します。
この論文では、RLまたは検索に基づいた検証剤ベースの（VB）メソッドを使用したFinetuning LLMSが、固定額のコンピューティング/データ予算を考慮して、検索トレースの蒸留またはクローニングに基づいた検証剤フリー（VF）アプローチよりもはるかに優れていることを証明します。
。
さらに、テスト時間計算（出力トークンの長さとして測定）とトレーニングデータをスケーリングする際に、ベースの事前訓練を受けたLLMが正しい溶液トレースよりも不均一な分布を提示する場合、VFメソッドのサブオプティマリティはVBと比較して不十分であることを示します（例えば
、異なる長さ、スタイルなど）、およびそれからサンプリングされたトレースの報酬よりも非鋭い分布を認めます。
抗濃縮[erd \ h {o} s、1945]を使用してこの状態を正式にします。
これは、VBメソッドがより漸近的にスケールし、VBとVFメソッドのパフォーマンスギャップがテスト時間の予算が増加するにつれて拡大するという強い結果を意味します。
3/8/32Bサイズの事前訓練を受けたLLMの教訓的および数学推論の両方の問題について、理論を実証的に裏付けています。テスト時間計算のスケーリングには検証が重要であることがわかります。

要約(オリジナル)

Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

arxiv情報

著者	Amrith Setlur,Nived Rajaraman,Sergey Levine,Aviral Kumar
発行日	2025-02-18 18:54:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Test-Time Compute Without Verification or RL is Suboptimal

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー