Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

要約

大規模な言語モデル（LLM）は広くアクセス可能であり、すべての教育レベルで学習者に到達しています。
この開発は、それらの使用が重要な学習プロセスを回避し、確立された評価形式の完全性を損なう可能性があるという懸念を提起しました。
したがって、問題解決が指導と評価において中心的な役割を果たす物理教育では、LLMの物理学固有の問題解決能力を理解することが不可欠です。
このような理解は、LLMを指導と評価に統合するための責任ある、教育的に健全なアプローチを通知するための鍵です。
したがって、この研究では、汎用LLM（GPT-4O、さまざまなプロンプトテクニックを使用）の問題解決パフォーマンスと、明確に定義されたオリンピアドの問題のセットに基づいて、ドイツの物理オリンピアードの参加者の参加者と推論最適化モデル（O1-Preview）を比較しています。
生成されたソリューションの正しさを評価することに加えて、この研究はLLM生成ソリューションの特徴的な強さと制限を分析します。
この研究の発見は、テストしたLLMS（GPT-4OとO1-PREVIEW）の両方が、オリンピック型の物理学の問題に関する高度な問題解決能力を実証し、平均して人間の参加者を上回ることを示しています。
プロンプトテクニックはGPT-4Oのパフォーマンスにほとんど影響を与えませんでしたが、O1-PreviewはGPT-4Oと人間のベンチマークの両方をほぼ一貫して上回りました。
これらの調査結果に基づいて、この研究では、物理学教育における総合的および形成的評価の設計への影響について説明します。

要約(オリジナル)

Large language models (LLMs) are now widely accessible, reaching learners at all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (GPT-4o, using varying prompting techniques) and a reasoning-optimized model (o1-preview) with that of participants of the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes characteristic strengths and limitations of LLM-generated solutions. The findings of this study indicate that both tested LLMs (GPT-4o and o1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on GPT-4o’s performance, while o1-preview almost consistently outperformed both GPT-4o and the human benchmark. Based on these findings, the study discusses implications for the design of summative and formative assessment in physics education, including how to uphold assessment integrity and support students in critically engaging with LLMs.

arxiv情報

著者	Paul Tschisgale,Holger Maus,Fabian Kieser,Ben Kroehs,Stefan Petersen,Peter Wulff
発行日	2025-05-14 14:46:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー