Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

要約

【タイトル】日本の医師国家試験におけるGPT-4とChatGPTの評価

【要約】
– 大規模言語モデル（LLMs）の人気が多言語話者の中で高まるにつれ、それらをベンチマークし、英語以外の言語でもモデルの振る舞い、失敗、制限を理解することが重要だと考えられている。
– 本研究では、過去5年間の日本の医師国家試験において、LLM API（ChatGPT、GPT-3、GPT-4）を評価する。日本語ネイティブなNLP研究者と、日本在住の心臓専門医によるチームで実施した。
– 本試験では、GPT-4がChatGPTとGPT-3よりも優れ、5年間の国家試験を全てパスし、英語とは異なる言語におけるLLMsの可能性を示唆する。
– ただし、本評価は、現在のLLM APIの重要な制限も明らかにしている。第一に、LLMsは、日本の医療実践で厳密に避けなければならない、安楽死を提案するなど、禁止された選択肢を選択することがある。さらに、非ラテン文字が現在のパイプラインでトークン化されている方法のため、APIコストが一般的に高く、最大コンテキストサイズが小さくなっている。
– ベンチマークである Igaku QA と、すべてのモデル出力と試験メタデータをリリースすることで、より多様なLLMsの応用についての進展を促すことを期待している。ベンチマークは https://github.com/jungokasai/IgakuQA で利用可能。

要約(オリジナル)

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMs’ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.

arxiv情報

著者	Jungo Kasai,Yuhei Kasai,Keisuke Sakaguchi,Yutaro Yamada,Dragomir Radev
発行日	2023-03-31 13:04:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー