MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

要約

自然言語を介して局所的な視覚領域を定着させることを目的とした参照式理解 (REC) は、マルチモーダルアライメントに大きく依存するタスクです。
既存の手法のほとんどは、強力な事前トレーニング済みモデルを利用して、完全な微調整によって視覚/言語知識を伝達します。
ただし、バックボーン全体を完全に微調整すると、事前トレーニングに組み込まれた豊富な事前知識が失われるだけでなく、多大な計算コストが発生します。
最近のパラメータ効率転移学習 (PETL) 手法の出現を動機として、私たちは REC タスクを効果的かつ効率的な方法で解決することを目指しています。
これらの PETL 手法を REC タスクに直接適用することは、局所的な視覚認識と視覚言語の調整を正確に行うための特定領域の機能が欠けているため、不適切です。
したがって、我々は、マルチモーダル事前ガイドパラメータ効率的チューニングの新しいフレームワーク、すなわちMaPPERを提案します。
具体的には、MaPPER は、アライメントされた事前アライメントによってガイドされる動的事前アダプターと、視覚的な認識を向上させるために正確なローカルセマンティクスを抽出するローカル畳み込みアダプターで構成されます。
さらに、クロスモーダル調整を容易にするために事前ガイドをさらに利用するために、事前ガイド付きテキストモジュールが提案されています。
広く使用されている 3 つのベンチマークの実験結果は、MaPPER が完全な微調整や他の PETL 手法と比較して、調整可能なバックボーンパラメーターが 1.41% のみで最高の精度を達成することを示しています。
私たちのコードは https://github.com/liuting20/MaPPER で入手できます。

要約(オリジナル)

Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by an aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters. Our code is available at https://github.com/liuting20/MaPPER.

arxiv情報

著者	Ting Liu,Zunnan Xu,Yue Hu,Liangtao Shi,Zhiqiang Wang,Quanjun Yin
発行日	2025-01-02 15:26:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー