PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

要約

大規模な言語モデルは、さまざまなドメイン、特に数学と論理推論にわたって顕著な能力を示しています。
しかし、現在の評価は物理ベースの推論を見落としています – 物理学の定理と制約を必要とする複雑なタスク。
知識ベース（25％）および推論ベース（75％）の問題を含む1,200の問題ベンチマークであるPhysreasonを提示します。後者は3つの難易度レベル（簡単、中、硬い）に分割されます。
特に、問題には平均8.1ソリューションステップが必要であり、物理ベースの推論の複雑さを反映して、15.6が必要です。
効率的な回答レベルで包括的なステップレベルの評価を組み込んだ物理ソリューションオートスコアリングフレームワークを提案します。
DeepSeek-R1、Gemini-2.0-Flash-Shinking、O3-Mini-Highなどの最高のパフォーマンスモデルは、回答レベルの評価で60％未満であり、パフォーマンスは知識の質問（75.11％）から困難な問題（31.95％に低下します。
）。
ステップレベルの評価を通じて、物理学定理アプリケーション、物理プロセスの理解、計算、および物理学の状態分析という4つの重要なボトルネックを特定しました。
これらの調査結果は、物理学ベースの推論能力を大規模な言語モデルにおいて評価するための斬新で包括的なベンチマークとして物理学的なベンチマークとして位置付けています。
当社のコードとデータは、https：/dxzxy12138.github.io/physreasonで公開されます。

要約(オリジナル)

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning – a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

arxiv情報

著者	Xinyu Zhang,Yuxuan Dong,Yanrui Wu,Jiaxing Huang,Chengyou Jia,Basura Fernando,Mike Zheng Shou,Lingling Zhang,Jun Liu
発行日	2025-02-17 17:24:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー