RadVLM: A Multitask Conversational Vision-Language Model for Radiology

要約

放射線科医の不足と相まって、胸部X線（CXR）の広範な使用は、自動化されたCXR分析とAIアシストレポートへの関心の高まりを推進しています。
既存のビジョン言語モデル（VLM）は、レポート生成や異常検出などの特定のタスクで有望ですが、インタラクティブな診断機能のサポートが不足していることがよくあります。
この作業では、CXR解釈のために設計されたコンパクトなマルチタスク会話財団モデルであるRadVLMを紹介します。
この目的のために、レポート生成、異常分類、視覚的接地などの単一ターンタスクとマルチターン、マルチタスク会話などの両方の単一ターンタスクを含む100万を超える画像導入ペアを含む大規模な命令データセットをキュレートします。
相互作用。
この命令データセットでradVLMを微調整した後、再実装されたベースラインVLMとともに、さまざまなタスクでそれを評価します。
私たちの結果は、RadVLMが他の放射線科のタスクで競争力を維持しながら、会話能力と視覚的接地で最先端のパフォーマンスを達成していることを示しています。
アブレーション研究は、特に限られた注釈付きデータを備えたシナリオの場合、複数のタスクにわたる共同トレーニングの利点をさらに強調しています。
一緒に、これらの発見は、臨床的に関連するAIアシスタントとしてのRadVLMの可能性を強調し、より効果的でアクセスしやすい診断ワークフローをサポートするための構造化されたCXR解釈と会話機能を提供します。

要約(オリジナル)

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks — such as report generation, abnormality classification, and visual grounding — and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

arxiv情報

著者	Nicolas Deperrois,Hidetoshi Matsuo,Samuel Ruipérez-Campillo,Moritz Vandenhirtz,Sonia Laguna,Alain Ryser,Koji Fujimoto,Mizuho Nishio,Thomas M. Sutter,Julia E. Vogt,Jonas Kluckert,Thomas Frauenfelder,Christian Blüthgen,Farhad Nooralahzadeh,Michael Krauthammer
発行日	2025-02-05 16:27:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー