DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

要約

ビジョン言語モデル（VLM）は、視覚的表現とテキスト表現を整列させる能力により、2D医療画像分析に広く適用されています。
ただし、VLMを3Dイメージングに拡張することは、計算上困難なままです。
既存の3D VLMは、多くの場合、視覚変圧器（VITS）に依存しています。これは、自己立文の2次複雑さのために計算上高価なもの、またはカーネルサイズが増加するにつれて多数のパラメーターとフロップが必要です。
DCFormerを紹介します。これは、3D畳み込みを3つの並列1D畳み込みに沿って、深さ、高さ、幅の寸法に沿って3つの並列1D畳み込みに因数分解します。
この設計により、空間情報が保存され、計算コストが大幅に削減されます。
クリップベースのビジョン言語フレームワークに統合されたDCFormerは、50,188ペアの3DチェストCTボリュームと放射線学レポートのデータセットであるCT-Rateでトレーニングおよび評価されます。
18の病理のゼロショットおよび微調整された検出、および画像テキスト検索タスクの検出では、DCFORMERはCT-VIT、VIT、Convnext、Poolformer、TransUnetなどの最先端の3Dビジョンエンコーダーよりも一貫して優れています。
これらの結果は、Scalable、臨床的に展開可能な3D医療VLMSのDCFormerの可能性を強調しています。
私たちのコードは、https：//github.com/mirthai/dcformerで入手できます。

要約(オリジナル)

Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports. In zero-shot and fine-tuned detection of 18 pathologies, as well as in image-text retrieval tasks, DCFormer consistently outperforms state-of-the-art 3D vision encoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. These results highlight DCFormer’s potential for scalable, clinically deployable 3D medical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.

arxiv情報

著者	Gorkem Can Ates,Yu Xin,Kuang Gong,Wei Shao
発行日	2025-04-25 16:36:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー