Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

要約

コンピュータビジョンは、マルチモーダルエンコーディングやチャットベースの大規模言語モデルによる画像との直接的なテキストインタラクションで大きな成功を収めているが、医療画像AI、特に3D画像における同様の進歩は、包括的なデータセットの不足により制限されている。この重大なギャップを解決するために、我々は、3D医療画像と対応するテキストレポートをペアにした最初のデータセットであるCT-RATEを紹介する。CT-RATEは、21,304人の患者の25,692枚の非造影3D胸部CTスキャンから構成されている。様々な再構成を経て、これらのスキャンは50,188ボリューム、合計1,430万以上の2Dスライスに拡張される。各スキャンには、対応するレントゲンレポートが添付されています。CT-RATEを活用し、CT-CLIPを開発する。CT-CLIPは、タスクに特化したトレーニングを必要とせず、幅広い応用を想定して設計された、CTに特化した対照言語-画像事前トレーニングフレームワークである。CT-CLIPがどのように2つのタスクで使用されるかを示す：多異常性検出と症例検索。驚くべきことに、マルチアブノーマリティ検出において、CT-CLIPは全ての主要な指標において最新の完全教師ありモデルを凌駕し、手動アノテーションの必要性を効果的に排除する。症例検索では、画像またはテキストクエリを用いて関連する症例を効率的に検索し、知識の普及を促進する。CT-CLIPのビジョンエンコーダと事前に学習された大規模な言語モデルを組み合わせることで、3D胸部CTボリューム用のビジョン言語基盤チャットモデルであるCT-CHATを作成します。CT-RATEデータセットに由来する270万以上の質問と回答のペアで微調整されたCT-CHATは、他のマルチモーダルAIアシスタントを凌駕し、3D医療画像処理に特化した手法の必要性を強調している。CT-RATE、CT-CLIP、CT-CHATのオープンソースリリースは、3D医用画像処理における重要な課題に対処するだけでなく、医用AIの将来的な革新と患者ケアの向上に向けた基盤を築くものです。

要約(オリジナル)

While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. To address this critical gap, we introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Through various reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3 million 2D slices. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in two tasks: multi-abnormality detection and case retrieval. Remarkably, in multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully supervised models across all key metrics, effectively eliminating the need for manual annotation. In case retrieval, it efficiently retrieves relevant cases using either image or textual queries, thereby enhancing knowledge dissemination. By combining CT-CLIP’s vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI assistants, underscoring the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging, but also lays the groundwork for future innovations in medical AI and improved patient care.

arxiv情報

著者	Ibrahim Ethem Hamamci,Sezgin Er,Chenyu Wang,Furkan Almas,Ayse Gulnihan Simsek,Sevval Nil Esirgun,Irem Doga,Omer Faruk Durugol,Weicheng Dai,Murong Xu,Muhammed Furkan Dasdelen,Bastian Wittmann,Tamaz Amiranashvili,Enis Simsar,Mehmet Simsar,Emine Bensu Erdemir,Abdullah Alanbay,Anjany Sekuboyina,Berkan Lafci,Christian Bluethgen,Kayhan Batmanghelich,Mehmet Kemal Ozdemir,Bjoern Menze
発行日	2025-04-04 13:02:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー