Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

要約

トレーニング前の計算のスケーリングは、ムリトゥーリング性を達成するのに効果的であることが証明されていますが、テスト時間スケーリングにも同じことがわかりますか？
この作業では、55の言語での競争レベルの問題を特徴とする多言語数学ベンチマークであるMCLMを紹介します。
3つのテスト時間スケーリングメソッドアウト結果報酬モデリング（ORM）、プロセス報酬モデリング（ORM）、および予算の強制（BF） – QWEN2.5-1.5B MATHとMR1-1.5Bの両方でトレーニングしたMR1-1.5B
拡張された推論のため。
私たちの実験は、ORMでQWEN2.5-1.5B MATHを使用するとMCLMで35.8のスコアを達成し、MR1-1.5BのBFが35.2を達成することを示しています。
「Thinking LLMS」は最近大きな注目を集めていますが、そのパフォーマンスは、同様のレベルの推論フロップに制約されていたBest-of-Nのような従来のスケーリング方法に匹敵することがわかります。
さらに、BFは英語のAIMEで20ポイントの改善をもたらしますが、他の言語で1.94ポイントの平均ゲインしか提供しません。テスト時間スケーリングが一般化できないことを調査した他のテスト時間スケーリング方法で一貫して一貫しているパターンです。
多言語のタスクに効果的に。
さらなる研究を促進するために、MCLM、MR1-1.5B、および評価結果をリリースします。

要約(オリジナル)

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although ‘thinking LLMs’ have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

arxiv情報

著者	Guijin Son,Jiwoo Hong,Hyunwoo Ko,James Thorne
発行日	2025-02-24 18:36:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー