HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

要約

自然言語を介して視覚領域をグラウンディングすることを目的とした視覚グラウンディングは、クロスモーダル調整に大きく依存するタスクです。
既存の研究では、マルチモーダルに対応する情報を無視しながら、視覚または言語の知識を個別に伝達するために、ユニモーダルの事前トレーニング済みモデルを利用していました。
対照的言語イメージ事前トレーニングおよび低ランク適応 (LoRA) 手法の最近の進歩に動機付けられ、マルチモーダル事前トレーニングに基づいてグラウンディングタスクを解決することを目指しています。
ただし、事前トレーニングとグラウンディングの間には、タスクに関する大きなギャップが存在します。
したがって、これらのギャップに対処するために、簡潔で効率的な階層マルチモーダル細粒変調フレームワーク、すなわち HiVG を提案します。
具体的には、HiVG は、多層適応クロスモーダルブリッジと階層マルチモーダル低ランク適応 (HiLoRA) パラダイムで構成されます。
クロスモーダルブリッジは、視覚的特徴とグラウンディングに必要な特徴の間の不一致に対処し、マルチレベルの視覚的特徴とテキスト特徴の間の接続を確立できます。
HiLoRA は、クロスモーダル機能を浅い層から深い層まで階層的に適応させることで、知覚エラーの蓄積を防ぎます。
5 つのデータセットに関する実験結果は、私たちのアプローチの有効性を実証し、重要な接地機能と有望なエネルギー効率の利点を示しています。
プロジェクトページ: https://github.com/linhuixiao/HiVG。

要約(オリジナル)

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

arxiv情報

著者	Linhui Xiao,Xiaoshan Yang,Fang Peng,Yaowei Wang,Changsheng Xu
発行日	2024-09-05 14:33:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー