First Finish Search: Efficient Test-Time Scaling in Large Language Models

要約

テスト時間スケーリング（TTS）は、推論中の計算の動的割り当てを含む、大規模な言語モデルの推論を改善する有望な方法を提供します。
既存のTTSメソッドはうまく機能しますが、多くの場合、長いデコードパスに依存しているか、多数のサンプルを生成する必要があり、トークンの使用と推論の遅延が増加します。
推論のために、より短い痕跡が長い痕跡よりもはるかに正しい可能性がはるかに高いという驚くべき事実を観察します。
これに動機付けられているため、$ n $の独立したサンプルを起動し、誰かが完了するとすぐに戻るトレーニングフリーの並列デコード戦略であるFirst Finish Search（FFS）を導入します。
FFSは、単純なデコード、ビーム検索、多数票、および予算を4つの推論モデル（DeepSeek-R1、R1-Distill-QWEN-32B、QWQ-32B、およびPHI-4-Raining-Plus）と4つのデータセット（AIME24、AIME25-I、AIME25-IIおよびGPQA Diamond）に並んで評価します。
DeepSeek-R1を使用すると、FFSはAIMEデータセットで82.23 \％$の精度を達成し、DeepSeek-R1のスタンドアロンの精度よりも15ドルの改善があり、OpenaiのO4-Miniパフォーマンスにほぼ一致します。
私たちの理論分析では、最短のトレースで停止することが正しい答えをもたらす可能性が高い理由を説明し、早期停止が最適ではない条件を特定します。
FFSの優雅さとシンプルさは、単純なTTS戦略が非常にうまく機能し、推論時に単純なアプローチの未開発の可能性を明らかにすることを示しています。

要約(オリジナル)

Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches $n$ independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA Diamond). With DeepSeek-R1, FFS achieves $82.23\%$ accuracy on the AIME datasets, a $15\%$ improvement over DeepSeek-R1’s standalone accuracy, nearly matching OpenAI’s o4-mini performance. Our theoretical analysis explains why stopping at the shortest trace is likely to yield a correct answer and identifies the conditions under which early stopping may be suboptimal. The elegance and simplicity of FFS demonstrate that straightforward TTS strategies can perform remarkably well, revealing the untapped potential of simple approaches at inference time.

arxiv情報

著者	Aradhye Agarwal,Ayan Sengupta,Tanmoy Chakraborty
発行日	2025-05-23 17:57:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

First Finish Search: Efficient Test-Time Scaling in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー