Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

要約

テスト時間スケーリング（TTS）は、推論フェーズ中に追加の計算を使用することにより、大規模な言語モデル（LLMS）のパフォーマンスを改善するための重要な方法です。
ただし、現在の研究では、ポリシーモデル、プロセス報酬モデル（PRM）、および問題の難易度がTTSにどのように影響するかを体系的に分析していません。
この分析の欠如は、TTSメソッドの理解と実際の使用を制限します。
このホワイトペーパーでは、2つのコア質問に焦点を当てています。（1）さまざまなポリシーモデル、PRM、および問題の難易度にわたるテスト時間計算をスケールする最適なアプローチは何ですか？
（2）拡張計算は、複雑なタスクでのLLMSのパフォーマンスをどの程度改善でき、このアプローチを通じてより小さな言語モデルはより大きなものを上回ることができますか？
Math-500と挑戦的なAIME24タスクに関する包括的な実験を通じて、次の観察結果があります。（1）計算最適なTTS戦略は、ポリシーモデル、PRM、および問題の難易度の選択に大きく依存しています。
（2）コンピューティングオプティマルTTS戦略により、非常に小さなポリシーモデルがより大きなモデルを上回る可能性があります。
たとえば、1B LLMはMath-500で405B LLMを超えることがあります。
さらに、MATH-500とAIME24の両方で、0.5B LLMがGPT-4Oを上回り、3B LLMは405B LLMを上回り、7B LLMはO1とDeepSeek-R1を叩き、推論効率が高くなります。
これらの調査結果は、各タスクとモデルの特定の特性にTTS戦略を適応させることの重要性を示しており、TTSがLLMSの推論能力を高めるための有望なアプローチであることを示しています。

要約(オリジナル)

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

arxiv情報

著者	Runze Liu,Junqi Gao,Jian Zhao,Kaiyan Zhang,Xiu Li,Biqing Qi,Wanli Ouyang,Bowen Zhou
発行日	2025-02-10 17:30:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー