s1: Simple test-time scaling

要約

テスト時間スケーリングは、パフォーマンスを向上させるために追加のテスト時間計算を使用する言語モデリングに対する有望な新しいアプローチです。
最近、OpenaiのO1モデルはこの能力を示しましたが、その方法論を公に共有しておらず、多くの複製努力につながりました。
テスト時間のスケーリングと強力な推論パフォーマンスを実現するための最も単純なアプローチを求めています。
まず、1,000の質問の小さなデータセットS1Kをキュレートします。これは、困難、多様性、品質という3つの基準に依存して、3つの基準に依存する推論トレースです。
第二に、モデルの思考プロセスを強制的に終了するか、モデルの生成を終了しようとするときに「待機」を複数回追加することにより、テスト時間計算を制御するための予算の強制を開発します。
これにより、モデルは答えを再確認することができ、多くの場合、誤った推論手順を修正します。
S1KでQWEN2.5-32B-Instruct言語モデルを監視した後、予算の強制を装備した後、モデルS1は競争の数学質問のO1-PREVIEWを最大27％超えています（Math and AIME24）。
さらに、S1を予算の強制でスケーリングすることで、AIEME24の50％から57％のテスト時間介入なしでパフォーマンスを超えて外挿できます。
私たちのモデル、データ、およびコードは、https：//github.com/simplescaling/s1でオープンソースです。

要約(オリジナル)

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending ‘Wait’ multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.

arxiv情報

著者	Niklas Muennighoff,Zitong Yang,Weijia Shi,Xiang Lisa Li,Li Fei-Fei,Hannaneh Hajishirzi,Luke Zettlemoyer,Percy Liang,Emmanuel Candès,Tatsunori Hashimoto
発行日	2025-01-31 18:48:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

s1: Simple test-time scaling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー