Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding

要約

人工知能 (AI) は、放射線科医を支援して医療画像の読影と診断の効率と精度を向上させる上で大きな可能性を示しています。
ただし、汎用性の高い AI モデルには大規模なデータと包括的な注釈が必要ですが、医療現場では現実的ではないことがよくあります。
最近の研究では、放射線画像読影用の言語情報モデルを開発するために対照的言語画像事前トレーニング (CLIP) を使用して、医療画像の自然で高品質な監督として放射線医学レポートを活用しています。
それにもかかわらず、これらのアプローチは通常、画像全体とレポートを対比し、イメージング領域とレポート文の間の局所的な関連性を無視するため、モデルのパフォーマンスと相互運用性が損なわれる可能性があります。
この論文では、解剖学レベルの CT 画像解釈のためのきめの細かい視覚言語モデル (fVLM) を提案します。
具体的には、CT画像の解剖学的領域を放射線医学レポートの対応する説明と明示的に照合し、各解剖学的構造に対して個別に対照的な事前トレーニングを実行します。
しかし、きめの細かい位置合わせは、主に解剖学的レベルの健康なサンプルや同様の病気の異常が豊富に存在することから、偽陰性に関するかなりの課題に直面しています。
この問題に取り組むために、正常サンプルと異常サンプルの両方の偽陰性を特定し、患者レベルから疾患を認識したペアリングへの対照学習を調整することを提案します。
私たちは、69,086 人の患者からの画像データとレポートデータで構成されるこれまでで最大の CT データセットを厳選し、15 の主要な解剖学的構造にわたる 54 の主要かつ重要な疾患診断タスクの包括的な評価を実施しました。
実験結果は、多用途の医用画像読影における fVLM の大きな可能性を実証しています。
ゼロショット分類タスクでは、54 の診断タスクで平均 AUC 81.3% を達成し、CLIP 手法と教師あり手法をそれぞれ 12.9% と 8.0% 上回りました。

要約(オリジナル)

Artificial intelligence (AI) shows great potential in assisting radiologists to improve the efficiency and accuracy of medical image interpretation and diagnosis. However, a versatile AI model requires large-scale data and comprehensive annotations, which are often impractical in medical settings. Recent studies leverage radiology reports as a naturally high-quality supervision for medical images, using contrastive language-image pre-training (CLIP) to develop language-informed models for radiological image interpretation. Nonetheless, these approaches typically contrast entire images with reports, neglecting the local associations between imaging regions and report sentences, which may undermine model performance and interoperability. In this paper, we propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation. Specifically, we explicitly match anatomical regions of CT images with corresponding descriptions in radiology reports and perform contrastive pre-training for each anatomy individually. Fine-grained alignment, however, faces considerable false-negative challenges, mainly from the abundance of anatomy-level healthy samples and similarly diseased abnormalities. To tackle this issue, we propose identifying false negatives of both normal and abnormal samples and calibrating contrastive learning from patient-level to disease-aware pairing. We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of fVLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively.

arxiv情報

著者	Zhongyi Shui,Jianpeng Zhang,Weiwei Cao,Sinuo Wang,Ruizhe Guo,Le Lu,Lin Yang,Xianghua Ye,Tingbo Liang,Qi Zhang,Ling Zhang
発行日	2025-01-24 14:50:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー