VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

要約

Generalist Vision Language Models（VLMS）はコンピュータービジョンに大きな進歩を遂げていますが、専門知識が不可欠なヘルスケアのような専門分野では不足しています。
従来のコンピュータービジョンタスクでは、創造的または近似的な回答が受け入れられる場合がありますが、ヘルスケアでは、精度が最重要です。ジェミニやGPT-4Oなどの大規模なマルチモーダルモデルは、ヘルスケアで必要な微妙なインターネット知識に依存するため、医療タスクには不十分です。
VLMは通常、3段階でトレーニングされます。ビジョン前トレーニング、ビジョン言語のトレーニング、および指導微調整（IFT）です。
IFTは通常、ジェネリックデータとヘルスケアデータの混合を使用して適用されています。
対照的に、医療VLMについては、医療データに焦点を当て、ドメインの専門家モデルからの情報を含む専門IFTの第4段階が必要であることを提案します。
医療用に開発されたドメインエキスパートモデルは、特定の臨床タスクのために特別に訓練されているため、重要です。
腫瘍を検出し、セグメンテーションと分類を通じて異常を分類します。セグメンテーションと分類は、特に放射線学で効果的にキャプチャするにはVLMが複雑すぎる医療データ$-$の特徴を学習します。
このペーパーでは、エキスパートモデルを介してドメインの知識を利用する医療VLMの新しいフレームワークであるVila-M3を紹介します。
実験を通じて、以前のSOTAモデルMed-Geminiで平均9％、特定のタスクで訓練されたモデルよりも約6％の最先端（SOTA）のパフォーマンスが向上しました。
私たちのアプローチは、医療用途向けの正確で信頼性の高いVLMを作成する際のドメインの専門知識の重要性を強調しています。

要約(オリジナル)

Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. In traditional computer vision tasks, creative or approximate answers may be acceptable, but in healthcare, precision is paramount.Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical data$-$features that are often too intricate for a VLM to capture effectively especially in radiology. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of ~9% over the prior SOTA model Med-Gemini and ~6% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.

arxiv情報

著者	Vishwesh Nath,Wenqi Li,Dong Yang,Andriy Myronenko,Mingxin Zheng,Yao Lu,Zhijian Liu,Hongxu Yin,Yucheng Tang,Pengfei Guo,Can Zhao,Ziyue Xu,Yufan He,Greg Heinrich,Yee Man Law,Benjamin Simon,Stephanie Harmon,Stephen Aylward,Marc Edgar,Michael Zephyr,Song Han,Pavlo Molchanov,Baris Turkbey,Holger Roth,Daguang Xu
発行日	2025-03-04 18:51:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー