Merlin: A Vision Language Foundation Model for 3D Computed Tomography

要約

米国では年間 8,500 万件を超えるコンピューター断層撮影 (CT) スキャンが実行されており、そのうち約 4 分の 1 は腹部に焦点を当てています。
現在の放射線科医の不足を考慮すると、これらの複雑な画像検査の解釈の負担を軽減するために人工知能を使用する大きな推進力があります。
自動医用画像読影のためのこれまでの最先端のアプローチは、ビジョン言語モデル (VLM) を活用しています。
ただし、現在の医療 VLM は一般に 2D 画像と短いレポートに限定されており、電子医療記録 (EHR) データを監視に活用していません。
Merlin は、ペアの CT スキャン (15,331 枚の CT からの 600 万以上の画像)、EHR 診断コード (180 万以上のコード)、および放射線学レポート (600 万以上のトークン) を使用してトレーニングする 3D VLM です。
私たちは 6 つのタスクタイプと 752 の個別のタスクに関して Merlin を評価します。
非適応 (既製) タスクには、ゼロショット所見分類 (31 所見)、表現型分類 (692 表現型)、ゼロショットクロスモーダル検索 (画像から所見、画像から印象) が含まれます。
適応されたタスクには、5 年間の疾病予測 (6 つの疾病)、放射線医学レポートの生成、および 3D セマンティックセグメンテーション (20 臓器) が含まれます。
5,137 CT のテストセットで内部検証を実行し、7,000 の臨床 CT と 2 つの公開 CT データセット (VerSe、TotalSegmentator) で外部検証を実行します。
これらの臨床関連の評価を超えて、さまざまなネットワークアーキテクチャとトレーニング戦略の有効性を評価して、Merlin が既存のタスク固有のベースラインに対して良好なパフォーマンスを示していることを示します。
私たちは、必要な下流タスクのパフォーマンスに必要なトレーニングデータを経験的に評価するためのデータスケーリング則を導き出します。
さらに、トレーニングに数百の GPU を必要とする従来の VLM とは異なり、すべてのトレーニングを単一の GPU で実行します。

要約(オリジナル)

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin – a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU.

arxiv情報

著者	Louis Blankemeier,Joseph Paul Cohen,Ashwin Kumar,Dave Van Veen,Syed Jamal Safdar Gardezi,Magdalini Paschali,Zhihong Chen,Jean-Benoit Delbrouck,Eduardo Reis,Cesar Truyts,Christian Bluethgen,Malte Engmann Kjeldskov Jensen,Sophie Ostmeier,Maya Varma,Jeya Maria Jose Valanarasu,Zhongnan Fang,Zepeng Huo,Zaid Nabulsi,Diego Ardila,Wei-Hung Weng,Edson Amaro Junior,Neera Ahuja,Jason Fries,Nigam H. Shah,Andrew Johnston,Robert D. Boutin,Andrew Wentland,Curtis P. Langlotz,Jason Hom,Sergios Gatidis,Akshay S. Chaudhari
発行日	2024-06-10 17:53:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー