Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

要約

医用視覚質問応答 (VQA) は、視覚情報と言語情報の両方を考慮して、特定の医用画像の臨床質問に答える必要があるやりがいのあるタスクです。
ただし、医療 VQA のトレーニングデータの規模が小さいため、モデルの汎化パフォーマンスを向上させるために、トレーニング前の微調整パラダイムが一般的に使用されているソリューションです。
この論文では、マスクされた言語モデリングと事前トレーニングとしての画像テキストマッチングとともに、単峰性と多峰性のコントラスト損失の両方を活用することにより、医療画像キャプションデータセットを使用して入力画像とテキストの単峰性と多峰性の特徴表現を学習する新しい自己教師ありアプローチを紹介します。
目標。
事前トレーニングされたモデルは、下流の医療 VQA タスクに転送されます。
提案されたアプローチは、3 つの公的に利用可能な医療 VQA データセットで最先端 (SOTA) パフォーマンスを達成し、それぞれ 2.2%、14.7%、1.7% という大幅な精度向上を実現します。
さらに、包括的な分析を実施して、アプローチのさまざまなコンポーネントの有効性を検証し、さまざまな事前トレーニング設定を研究します。
コードとモデルは https://github.com/pengfeiliHEU/MUMC で入手できます。

要約(オリジナル)

Medical visual question answering (VQA) is a challenging task that requires answering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption datasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pretraining objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive analysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.

arxiv情報

著者	Pengfei Li,Gang Liu,Jinlong He,Zixu Zhao,Shenjun Zhong
発行日	2023-07-11 15:00:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー