D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

要約

大規模言語モデル (LLM) は、さまざまなタスクで優れたパフォーマンスを発揮するため、大きな注目を集め、広く使用されています。
しかし、幻覚、事実の不一致、数値的定量的推論の限界などの問題を含む、独自の一連の課題がないわけではありません。
さまざまな推論タスクにおける LLM の評価は、依然として活発な研究領域です。
LLM の画期的な進歩に先立って、Transformers は医療分野での成功をすでに証明しており、さまざまな自然言語理解 (NLU) タスクに効果的に使用されていました。
この傾向に続いて、LLM は医療分野でも訓練され活用されており、事実の正確さ、安全プロトコルの順守、および固有の制限に関する懸念が生じています。
この論文では、臨床試験レポートをデータセットとして使用して、一般的なオープンソースおよびクローズドソース LLM の自然言語推論機能を評価することに焦点を当てます。
各 LLM のパフォーマンス結果を提示し、開発セットでのパフォーマンスをさらに分析します。特に、医療略語が含まれ、数値的定量的推論が必要な困難なインスタンスに焦点を当てています。
当社の主要 LLM である Gemini は、テストセット F1 スコア 0.748 を達成し、タスクスコアボードで 9 位を確保しました。
私たちの研究は、この種のものとしては初めてのものであり、医療分野における LLM の推論機能の徹底的な検査を提供します。

要約(オリジナル)

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

arxiv情報

著者	Duygu Altinok
発行日	2024-05-07 10:11:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー