A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

要約

推論は、学術研究所と産業研究所の両方から急速な進歩を遂げ、言語モデル（LMS）の次の主要なフロンティアとして浮上しています。
ただし、この進歩はしばしば方法論的な厳密さを上回り、多くの評価が透明性、堅牢性、または統計的接地を欠くベンチマークプラクティスに依存しています。
この作業では、包括的な経験的研究を実施し、現在の数学的推論ベンチマークは、デコードパラメーター、ランダムシード、迅速なフォーマット、さらにはハードウェアおよびソフトウェアフレームワークの構成など、微妙な実装の選択に非常に敏感であることがわかります。
最近の研究で報告されているパフォーマンスの向上は、不明確な比較または報告されていない分散源に頻繁にかかっています。
これらの問題に対処するために、明確に定義されたベストプラクティスと報告基準を備えた標準化された評価フレームワークを提案します。
このフレームワークを使用して、最近の方法を再評価し、強化学習（RL）が近づいていること（以前の請求をはるかに下回るだけでなく、特にAIME24のような小規模なベンチマークで過度に適合する傾向があることがわかります。
対照的に、監視された微調整（SFT）メソッドは、一貫してより強力な一般化を示しています。
再現性を促進するために、ベンチマークを推論するために、すべてのコード、プロンプト、モデルの出力をリリースし、将来の作業のためにより厳格な基盤を確立します。

要約(オリジナル)

Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices – including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements – far below prior claims – and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.

arxiv情報

著者	Andreas Hochlehnert,Hardik Bhatnagar,Vishaal Udandarao,Samuel Albanie,Ameya Prabhu,Matthias Bethge
発行日	2025-04-09 17:58:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー