Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

要約

視覚的グラウンディングは、画像の特定領域の視覚情報を対応する自然言語表現と一致させることを目的としています。
現在の視覚的グラウンディング方法は、事前にトレーニングされた視覚的バックボーンと言語的バックボーンを別々に利用して、視覚的特徴と言語的特徴を取得します。
これら 2 種類の特徴は、慎重に設計されたネットワークを介して融合されますが、特徴が異質であるため、マルチモーダル推論には適用できません。
この問題は、現在の視覚グラウンディング方法で使用されているシングルモーダルの事前トレーニングバックボーン間のドメインギャップから発生します。これは、従来のエンドツーエンドトレーニング方法ではほとんど克服できません。
これを軽減するために、私たちの研究では、視覚グラウンディングの事前トレーニング済みモデルを強化する (EpmVG) フレームワークを提案しています。これは、視覚グラウンディングのタスクをガイドするマルチモーダルな事前トレーニング済みモデルを抽出します。
EpmVG は、新しいクロスモーダル蒸留メカニズムに基づいており、事前トレーニング済みモデルに画像とテキストの一貫性情報を効果的に導入して、バックボーンネットワークに存在するドメインギャップを削減し、それによってモデルのパフォーマンスを向上させることができます。
視覚的なグラウンディングタスク。
従来から使用されている 5 つのデータセットに対して広範な実験が実行され、結果は、私たちの方法が最先端の方法よりも優れたパフォーマンスを達成することを示しています。

要約(オリジナル)

Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones separately to obtain visual features and linguistic features. Although these two types of features are then fused via delicately designed networks, the heterogeneity of the features makes them inapplicable for multi-modal reasoning. This problem arises from the domain gap between the single-modal pre-training backbone used in current visual grounding methods, which can hardly be overcome by the traditional end-to-end training method. To alleviate this, our work proposes an Empowering pre-trained model for Visual Grounding (EpmVG) framework, which distills a multimodal pre-trained model to guide the visual grounding task. EpmVG is based on a novel cross-modal distillation mechanism, which can effectively introduce the consistency information of images and texts in the pre-trained model, to reduce the domain gap existing in the backbone networks, thereby improving the performance of the model in the visual grounding task. Extensive experiments are carried out on five conventionally used datasets, and results demonstrate that our method achieves better performance than state-of-the-art methods.

arxiv情報

著者	Jiaxi Wang,Wenhui Hu,Xueyang Liu,Beihu Wu,Yuting Qiu,YingYing Cai
発行日	2023-12-29 15:32:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー