DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

要約

Vision-Language Models（VLMS）は、視覚表現とテキスト表現を調整し、2D医療イメージングで高性能のゼロショット分類と画像テキスト検索を可能にします。
ただし、VLMSを3D医療イメージングに拡張することは、計算上困難なままです。
既存の3D VLMは、自己attentionの二次の複雑さ、またはカーネルサイズが増加するにつれて過剰なパラメーターとフロップを必要とする3D畳み込みのために計算上高価な視覚変圧器（VITS）に依存しています。
DCFormerを紹介します。これは、深さ、高さ、幅に沿って3D畳み込みを3つの並列1D畳み込みに因数分解する効率的な3D医療画像エンコーダーです。
この設計により、空間情報が保存され、計算コストが大幅に削減されます。
クリップベースのビジョン言語フレームワークに統合されたDCFORMERは、CT-Rateで評価されます。これは、18の病理間でゼロショットマルチアブノマリティ検出のために、50,188ペアの3D胸部CTボリュームと放射線学レポートのデータセットで評価されます。
VIT、Convnext、Poolformer、およびTransUnetと比較して、DCFORMERは優れた効率と精度を達成し、DCFORFOR-TINYは62.0％の精度と46.3％のF1スコアに達し、パラメーターが大幅に少なくなります。
これらの結果は、Scalable、臨床的に展開可能な3D医療VLMSのDCFormerの可能性を強調しています。
私たちのコードは公開されます。

要約(オリジナル)

Vision-language models (VLMs) align visual and textual representations, enabling high-performance zero-shot classification and image-text retrieval in 2D medical imaging. However, extending VLMs to 3D medical imaging remains computationally challenging. Existing 3D VLMs rely on Vision Transformers (ViTs), which are computationally expensive due to self-attention’s quadratic complexity, or 3D convolutions, which demand excessive parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D medical image encoder that factorizes 3D convolutions into three parallel 1D convolutions along depth, height, and width. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports, for zero-shot multi-abnormality detection across 18 pathologies. Compared to ViT, ConvNeXt, PoolFormer, and TransUNet, DCFormer achieves superior efficiency and accuracy, with DCFormer-Tiny reaching 62.0% accuracy and a 46.3% F1-score while using significantly fewer parameters. These results highlight DCFormer’s potential for scalable, clinically deployable 3D medical VLMs. Our codes will be publicly available.

arxiv情報

著者	Gorkem Can Ates,Kuang Gong,Wei Shao
発行日	2025-02-07 17:10:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー