PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

要約

Phybenchを紹介します。Phybenchは、物理的なコンテキストで大規模な言語モデル（LLM）の推論能力を評価するために設計された斬新で高品質のベンチマークを紹介します。
Phybenchは、現実的な物理的プロセスを理解し、推論するモデルの能力を評価するために設計された、実際の物理シナリオに基づいて、綿密にキュレーションされた500の物理学の問題で構成されています。
メカニズム、電磁気、熱力学、光学、現代物理学、および高度な物理学をカバーするため、ベンチマークは高校の演習から学部の問題や物理学のオリンピックの課題まで、難易度に及びます。
さらに、数学的式間の編集距離に基づいた新しい評価メトリックである式編集距離（EED）スコアを提案します。これは、モデル推論プロセスの違いと、従来のバイナリスコアリング方法を超えた結果を効果的にキャプチャします。
PhybenchでさまざまなLLMを評価し、そのパフォーマンスを人間の専門家と比較します。
私たちの結果は、最先端の推論モデルでさえ、人間の専門家に大幅に遅れており、それらの限界と複雑な身体的推論シナリオの改善の必要性を強調していることを明らかにしています。
ベンチマークの結果とデータセットは、https：//phybench official.github.io/phybench-demo/で公開されています。

要約(オリジナル)

We introduce PHYBench, a novel, high-quality benchmark designed for evaluating reasoning capabilities of large language models (LLMs) in physical contexts. PHYBench consists of 500 meticulously curated physics problems based on real-world physical scenarios, designed to assess the ability of models to understand and reason about realistic physical processes. Covering mechanics, electromagnetism, thermodynamics, optics, modern physics, and advanced physics, the benchmark spans difficulty levels from high school exercises to undergraduate problems and Physics Olympiad challenges. Additionally, we propose the Expression Edit Distance (EED) Score, a novel evaluation metric based on the edit distance between mathematical expressions, which effectively captures differences in model reasoning processes and results beyond traditional binary scoring methods. We evaluate various LLMs on PHYBench and compare their performance with human experts. Our results reveal that even state-of-the-art reasoning models significantly lag behind human experts, highlighting their limitations and the need for improvement in complex physical reasoning scenarios. Our benchmark results and dataset are publicly available at https://phybench-official.github.io/phybench-demo/.

arxiv情報

著者	Shi Qiu,Shaoyang Guo,Zhuo-Yang Song,Yunbo Sun,Zeyu Cai,Jiashen Wei,Tianyu Luo,Yixuan Yin,Haoxu Zhang,Yi Hu,Chenyang Wang,Chencheng Tang,Haoling Chang,Qi Liu,Ziheng Zhou,Tianyu Zhang,Jingtian Zhang,Zhangyi Liu,Minghao Li,Yuku Zhang,Boxuan Jing,Xianqi Yin,Yutong Ren,Zizhuo Fu,Weike Wang,Xudong Tian,Anqi Lv,Laifu Man,Jianxiang Li,Feiyu Tao,Qihua Sun,Zhou Liang,Yushu Mu,Zhongxuan Li,Jing-Jun Zhang,Shutao Zhang,Xiaotian Li,Xingqi Xia,Jiawei Lin,Zheyu Shen,Jiahang Chen,Qiuhao Xiong,Binran Wang,Fengyuan Wang,Ziyang Ni,Bohan Zhang,Fan Cui,Changkun Shao,Qing-Hong Cao,Ming-xing Luo,Muhan Zhang,Hua Xing Zhu
発行日	2025-04-22 17:53:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー