MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

要約

大規模な言語モデル（LLM）は、ヘルスケアのさまざまなアプリケーションに対して大きな約束を示しています。
ただし、高品質のドメイン固有のデータセットとベンチマークがないため、アラビア語の医療ドメインでの有効性は未開拓のままです。
この研究では、Medarabiqを紹介します。Medarabiqは、7つのアラビア語の医療タスクで構成される新しいベンチマークデータセットであり、複数の専門分野をカバーし、複数選択の質問、fill-in-the Blank、および患者ドクターの質問応答を含みます。
最初に、過去の健康診断と公開されているデータセットを使用してデータセットを構築しました。
次に、バイアス緩和を含むさまざまなLLM機能を評価するために、さまざまな変更を導入しました。
GPT-4O、Claude 3.5-Sonnet、およびGemini 1.5を含む5つの最先端のオープンソースと独自のLLMで広範な評価を実施しました。
私たちの調査結果は、ヘルスケアにおけるLLMの公正な展開とスケーラビリティを確保するために、異なる言語にまたがる新しい高品質のベンチマークを作成する必要性を強調しています。
このベンチマークを確立し、データセットをリリースすることにより、ヘルスケアでの生成AIを公平に使用するためのLLMSの多言語機能を評価および強化することを目的とした将来の研究の基盤を提供します。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.

arxiv情報

著者	Mouath Abu Daoud,Chaimae Abouzahir,Leen Kharouf,Walid Al-Eisawi,Nizar Habash,Farah E. Shamout
発行日	2025-05-06 11:07:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー