Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

要約

コンピュータービジョンは、マルチモーダルエンコーディングと、チャットベースの大規模言語モデルを介した画像との直接的なテキストインタラクションによって多大な成功を収めてきましたが、医用画像 AI、特に 3D イメージングにおける同様の進歩は、包括的なデータセットが不足しているために制限されてきました。
この重大なギャップに対処するために、3D 医療画像と対応するテキストレポートを組み合わせた最初のデータセットである CT-RATE を導入します。
CT-RATE は、21,304 人の固有の患者からの 25,692 件の非造影 3D 胸部 CT スキャンで構成されています。
さまざまな再構成を通じて、これらのスキャンは 50,188 ボリュームに拡張され、合計 1,430 万を超える 2D スライスになります。
各スキャンには、対応する放射線医学レポートが添付されます。
CT-RATE を活用して、タスク固有のトレーニングを必要とせず、幅広いアプリケーション向けに設計された CT に焦点を当てた対照的な言語と画像の事前トレーニングフレームワークである CT-CLIP を開発します。
CT-CLIP が複数の異常の検出と症例の検索という 2 つのタスクでどのように使用できるかを示します。
驚くべきことに、CT-CLIP は、複数の異常の検出において、すべての主要な指標にわたって最先端の完全監視モデルよりも優れたパフォーマンスを示し、手動によるアノテーションの必要性を効果的に排除します。
症例の検索では、画像またはテキストのクエリを使用して関連する症例を効率的に検索し、知識の普及を強化します。
CT-CLIP のビジョンエンコーダを事前トレーニングされた大規模言語モデルと組み合わせることで、3D 胸部 CT ボリューム用のビジョン言語の基礎チャットモデルである CT-CHAT を作成します。
CT-RATE データセットから得られた 270 万を超える質問と回答のペアに基づいて微調整された CT-CHAT は、他のマルチモーダル AI アシスタントを上回り、3D 医療画像処理における特殊な手法の必要性を強調しています。
CT-RATE、CT-CLIP、および CT-CHAT のオープンソースリリースは、総合すると、3D 医療画像処理における重大な課題に対処するだけでなく、医療 AI における将来のイノベーションと患者ケアの向上のための基礎も築きます。

要約(オリジナル)

While computer vision has achieved tremendous success with multimodal encoding and direct textual interaction with images via chat-based large language models, similar advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. To address this critical gap, we introduce CT-RATE, the first dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Through various reconstructions, these scans are expanded to 50,188 volumes, totaling over 14.3 million 2D slices. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in two tasks: multi-abnormality detection and case retrieval. Remarkably, in multi-abnormality detection, CT-CLIP outperforms state-of-the-art fully supervised models across all key metrics, effectively eliminating the need for manual annotation. In case retrieval, it efficiently retrieves relevant cases using either image or textual queries, thereby enhancing knowledge dissemination. By combining CT-CLIP’s vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT surpasses other multimodal AI assistants, underscoring the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging but also lays the groundwork for future innovations in medical AI and improved patient care.

arxiv情報

著者	Ibrahim Ethem Hamamci,Sezgin Er,Furkan Almas,Ayse Gulnihan Simsek,Sevval Nil Esirgun,Irem Dogan,Muhammed Furkan Dasdelen,Omer Faruk Durugol,Bastian Wittmann,Tamaz Amiranashvili,Enis Simsar,Mehmet Simsar,Emine Bensu Erdemir,Abdullah Alanbay,Anjany Sekuboyina,Berkan Lafci,Christian Bluethgen,Mehmet Kemal Ozdemir,Bjoern Menze
発行日	2024-10-16 12:49:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー