AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

要約

教師あり、または自己教師ありにかかわらず、事前学習技術は、モデルの性能を向上させるために深層学習で広く使用されている。実世界の臨床シナリオでは、被験者/症例ごとに異なる磁気共鳴（MR）コントラストのセットが取得されることが多く、すべての症例間、および事前学習と微調整の間で一貫した入力モダリティを仮定する深層学習モデルにとって課題が生じる。既存の手法では、入力モダリティ／コントラストセットが事前学習済みモデルと不一致の場合、性能を維持するのに苦労し、しばしば精度が低下する。我々は、ケースごとに入力モダリティの可変セットを扱うことができる適応的な視覚変換器（AdaViT）フレームワークを提案する。我々は、異なる入力画像モダリティをトークンにエンコードするために動的トークナイザを利用し、トークンの可変長にわたって注意メカニズムを構築するために変換器の特性を利用する。広範な実験を通して、このアーキテクチャが、教師あり事前学習済みモデルを、異なる入力モダリティ／コントラストセットを持つ新しいデータセットに効果的に転送し、脳梗塞と脳腫瘍のセグメンテーションタスクにおいて、ゼロショットテスト、少数ショット微調整、後方転送で優れた性能をもたらすことを実証する。さらに、自己教師付き事前訓練において、提案手法は事前訓練データを最大化することができ、入力モダリティが変化する多様な下流タスクへの転送を容易にする。

要約(オリジナル)

Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.

arxiv情報

著者	Badhan Kumar Das,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier
発行日	2025-04-04 16:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー