TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine

要約

大規模な言語モデル（LLM）は、さまざまなNLPタスクや現代医学に優れていますが、伝統的な漢方薬（TCM）での評価は掘り下げられていません。
これに対処するために、TCM3CEVALを紹介します。これは、コアナレッジマスタリー、古典的なテキスト理解、臨床的意思決定という3つの次元にわたってTCMのLLMを評価するベンチマークです。
国際（例：GPT-4O）、中国語（例えば、InternLM）、および医療特有（例えば、pluse）を含む多様なモデルを評価します。
結果はパフォーマンスの階層を示しています。すべてのモデルには、子午線や順調な理論やさまざまなTCM学校などの特殊なサブドメインに制限があり、現在の能力と臨床的ニーズの間のギャップが明らかになります。
中国の言語的および文化的なプライアーを備えたモデルは、古典的なテキストの解釈と臨床的推論においてより良いパフォーマンスを発揮します。
TCM-3CEVALは、TCMのAI評価の標準を設定し、文化的に根拠のある医療ドメインでLLMを最適化するための洞察を提供します。
ベンチマークは、MedbenchのTCMトラックで利用でき、多次元の質問と実際のケースを通じて、基本的な知識、古典的なテキスト、臨床的意思決定におけるLLMSのTCM機能を評価することを目指しています。

要約(オリジナル)

Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian & Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench’s TCM track, aiming to assess LLMs’ TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.

arxiv情報

著者	Tianai Huang,Lu Lu,Jiayuan Chen,Lihao Liu,Junjun He,Yuping Zhao,Wenchao Tang,Jie Xu
発行日	2025-03-10 08:29:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー