A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

要約

目的: 最近リリースされたモデルに焦点を当て、放射線腫瘍学の物理学の質問に答える際の大規模言語モデル (LLM) のパフォーマンスを評価する最新の研究を紹介します。
方法: この研究には、経験豊富な物理学者によって事前に作成された 100 問の多肢選択式放射線腫瘍物理学の質問セットが使用されました。
質問の回答オプションはランダムにシャッフルされ、「新しい」試験セットが作成されました。
2024 年 9 月 30 日より前にリリースされたバージョンの 5 つの LLM (OpenAI o1-preview、GPT-4o、LLaMA 3.1 (405B)、Gemini 1.5 Pro、および Claude 3.5 Sonnet) が、これらの新しい試験セットを使用してクエリされました。
演繹的推論能力を評価するために、質問内の正解の選択肢は「上記のどれでもない」に置き換えられました。
次に、「最初に説明する」および「段階的に説明する」という指示プロンプトを使用して、この戦略が推論能力を向上させるかどうかをテストしました。
LLM のパフォーマンスは医学物理学者からの回答と比較されました。
結果: すべてのモデルはこれらの質問に関して専門家レベルのパフォーマンスを示し、o1-preview は多数決で医学物理学者をも上回りました。
正解の選択肢を「上記のどれでもない」に置き換えると、すべてのモデルでパフォーマンスが大幅に低下し、改善の余地があることがわかりました。
Explain-First およびステップバイステップの指示プロンプトは、LLaMA 3.1 (405B)、Gemini 1.5 Pro、および Claude 3.5 Sonnet モデルの推論能力を強化するのに役立ちました。
結論: 最近リリースされたこれらの LLM は、放射線腫瘍学の物理学の質問に答える上で専門家レベルのパフォーマンスを示し、放射線腫瘍学の物理学の教育とトレーニングを支援する大きな可能性を示しました。

要約(オリジナル)

Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the recently released models. Methods: A set of 100 multiple-choice radiation oncology physics questions, previously created by a well-experienced physicist, was used for this study. The answer options of the questions were randomly shuffled to create ‘new’ exam sets. Five LLMs — OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet — with the versions released before September 30, 2024, were queried using these new exam sets. To evaluate their deductive reasoning ability, the correct answer options in the questions were replaced with ‘None of the above.’ Then, the explain-first and step-by-step instruction prompts were used to test if this strategy improved their reasoning ability. The performance of the LLMs was compared with the answers from medical physicists. Results: All models demonstrated expert-level performance on these questions, with o1-preview even surpassing medical physicists with a majority vote. When replacing the correct answer options with ‘None of the above’, all models exhibited a considerable decline in performance, suggesting room for improvement. The explain-first and step-by-step instruction prompts helped enhance the reasoning ability of the LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet models. Conclusion: These recently released LLMs demonstrated expert-level performance in answering radiation oncology physics questions, exhibiting great potential to assist in radiation oncology physics education and training.

arxiv情報

著者	Peilong Wang,Jason Holmes,Zhengliang Liu,Dequan Chen,Tianming Liu,Jiajian Shen,Wei Liu
発行日	2025-01-21 17:20:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー