Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

要約

マルチモーダル事前トレーニングは、ペアの医療レポートから医療の視覚表現を学習する医療分野での可能性を実証します。
ただし、多くの事前トレーニングタスクでは臨床医による追加の注釈が必要であり、そのほとんどは、さまざまな病状の望ましい特徴を学習するようにモデルを明示的にガイドできません。
この論文では、マルチモーダルな事前トレーニングに Visual Question Answering (VQA) を利用し、対象となる病理学的特徴に焦点を当てたフレームワークをガイドします。
私たちは医療レポートの記述を活用して、さまざまな疾患に関連する多粒度の質問と回答のペアを設計します。これは、専門家による追加の注釈を必要とせずにフレームワークの事前トレーニングを支援します。
また、擬似テキスト特徴変換器を備えた新しい事前トレーニングフレームワークも提案します。これは、対照的な学習戦略によって視覚特徴をテキスト領域に近い準テキスト空間に変換するように設計されたモジュールです。
これにより、視覚と言語のギャップが狭まり、モダリティの調整が容易になります。
私たちのフレームワークは、レポート生成、分類、セグメンテーション、5 つのデータセットにわたる検出という 4 つの下流タスクに適用されます。
広範な実験により、他の最先端の方法と比較して、私たちのフレームワークの優位性が実証されています。
私たちのコードは https://github.com/MoramiSu/QFT-MICCAI2024 で入手できます。

要約(オリジナル)

Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. In this paper, we utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code is available at https://github.com/MoramiSu/QFT-MICCAI2024.

arxiv情報

著者	Tongkun Su,Jun Li,Xi Zhang,Haibo Jin,Hao Chen,Qiong Wang,Faqin Lv,Baoliang Zhao,Yin Hu
発行日	2024-10-01 13:36:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー