LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

要約

最近、マルチモーダル大規模モデル（MLLM）は、さまざまなビジョン言語タスクにわたる視覚的理解と推論において並外れた能力を実証しています。
ただし、MLLMは通常、撮影された機能や利用可能な医療知識を完全に活用していないため、ゼロショットの医療疾患認識ではあまり機能しません。
この課題に対処するために、ゼロショット医療疾患認識のためのシンプルで効果的なフレームワークであるLlava-Radzを提案します。
具体的には、MLLMデコーダーアーキテクチャの特性を活用して、さまざまなモダリティに合わせたモダリティ固有のトークンを組み込み、画像とテキストの表現を効果的に活用し、堅牢な交代アライメントを促進するために、エンドツーエンドのトレーニング戦略を設計します。
さらに、ドメインナレッジアンカーモジュール（DKAM）を導入して、画像テキストアライメントのカテゴリセマンティックギャップを軽減する大きなモデルの本質的な医学的知識を活用します。
DKAMはカテゴリレベルのアラインメントを改善し、正確な疾患認識を可能にします。
複数のベンチマークでの広範な実験は、Llava-Radzがゼロショットの疾患認識で従来のMLLMを大幅に上回り、確立された高度に最適化されたクリップベースのアプローチと比較して最先端のパフォーマンスを示すことを示しています。

要約(オリジナル)

Recently, multimodal large models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, MLLMs usually perform poorly in zero-shot medical disease recognition, as they do not fully exploit the captured features and available medical knowledge. To address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities, which effectively utilizes image and text representations and facilitates robust cross-modal alignment. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. DKAM improves category-level alignment, allowing for accurate disease recognition. Extensive experiments on multiple benchmarks demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition and exhibits the state-of-the-art performance compared to the well-established and highly-optimized CLIP-based approaches.

arxiv情報

著者	Bangyan Li,Wenxuan Huang,Yunhang Shen,Yeqiang Wang,Shaohui Lin,Jingzhong Lin,Ling You,Yinqi Zhang,Ke Li,Xing Sun,Yuling Sun
発行日	2025-03-10 16:05:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー