Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

要約

タイトル: 日本の医療国家試験におけるGPT-4とChatGPTの評価

要約:
-　多様な言語を話す人々の間でも広く普及している大規模言語モデル（LLM）が登場する中で、日本語を含む英語以外の言語でのモデルの特性、不具合、限界を理解するためにベンチマークが必要であると考える。
-　日本語を母語とする自然言語処理研究者と、日本在住の現役心臓専門医で構成されたチームにより、LLM API（ChatGPT、GPT-3、GPT-4）を過去5年間の日本の医療国家試験に適用し評価を行った。
-　実験の結果、GPT-4がChatGPTとGPT-3を上回り、全6年の試験に合格した。これは英語以外の言語においてもLLMの可能性を示すものである。
-　しかし、評価は現行のLLM APIの重大な限界を明らかにした。LLMは、時々日本の医療業務にとって厳密に避けるべき、安楽死などの禁止された選択肢を選択することがあった。さらに、非ラテンスクリプトが現在のトークナイザによってトークナイズされた方法により、APIのコストが一般的に高く、最大文脈サイズが小さいことが明らかになった。
-　研究チームは、Igaku QAをベンチマークとしてリリースし、全モデルの出力と試験メタデータを提供している。また、より多様なLLMの応用の進展を促進することを期待している。
-　ベンチマークはhttps://github.com/jungokasai/IgakuQAで入手可能。

要約(オリジナル)

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs’ potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.

arxiv情報

著者	Jungo Kasai,Yuhei Kasai,Keisuke Sakaguchi,Yutaro Yamada,Dragomir Radev
発行日	2023-04-05 07:53:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー