MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

要約

医療分野のビジョンと言語の事前トレーニング（VLP）は、画像テキストペアで対照的な学習を利用して、タスク間の効果的な転送を実現します。
しかし、マスクされたモデリング戦略を使用して現在のVLPがアプローチします。医療ドメインに適用すると、2つの課題があります。
第一に、現在のモデルは、医療データが不足しているため、主要な病理学的特徴を正確に再構築するのに苦労しています。
第二に、ほとんどの方法では、ペアの画像テキストまたは画像のみのデータのみを採用し、ペアの両方のデータと対応のないデータの組み合わせを活用できません。
この目的のために、このペーパーでは、MMCLIP（マスクされた医学的対照的な言語イメージ前トレーニング）フレームワークを提案して、病理学的学習を強化し、対応のないデータを介して学習を特徴としています。
まず、マルチモーダルの特徴の相互作用を介して病理学的視覚およびテキストのトークンを再構築することを学ぶことを学ぶことを学ぶことを学ぶ、注意マスク画像モデリング（ATTMIM）およびエンティティ駆動型マスク言語モデリングモジュール（ENTMLM）を紹介します。
ATTMIMモジュールは、テキスト機能に非常に敏感な画像機能の一部をマスクします。
これにより、MMCLIPは医学効率における非常に類似した画像データの再構築を改善することができます。
第二に、私たちのMMCLIPは、病気の現状のプロンプトを導入することにより、マルチモーダル学習を強化するために、不対のデータを大文字にします。
実験結果は、MMCLIPが5つのデータセットでゼロショットおよび微調整分類パフォーマンスのSOTAを達成することを示しています。
当社のコードは、https：//github.com/aigeeksgroup/mmclipで入手できます。

要約(オリジナル)

Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modeling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second, most methods only adopt either paired image-text or image-only data, failing to exploit the combination of both paired and unpaired data. To this end, this paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning and feature learning via unpaired data. First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM), which learns to reconstruct pathological visual and textual tokens via multi-modal feature interaction, thus improving medical-enhanced features. The AttMIM module masks a portion of the image features that are highly responsive to textual features. This allows MMCLIP to improve the reconstruction of highly similar image data in medicine efficiency. Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts. The experimental results show that MMCLIP achieves SOTA for zero-shot and fine-tuning classification performance on five datasets. Our code will be available at https://github.com/AIGeeksGroup/MMCLIP.

arxiv情報

著者	Biao Wu,Yutong Xie,Zeyu Zhang,Minh Hieu Phan,Qi Chen,Ling Chen,Qi Wu
発行日	2025-04-16 16:00:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー