Visual Position Prompt for MLLM based Visual Grounding

要約

マルチモーダルの大手言語モデル（MLLM）は、さまざまな画像関連のタスクに優れていますが、特に視覚接地などの位置認識タスクで、画像内の空間情報と正確に座標を整合することで課題に遭遇します。
この制限は、2つの重要な要因から生じます。
第一に、MLLMは明示的な空間的参照を欠いているため、テキストの説明を正確な画像の位置に関連付けることが困難です。
第二に、それらの機能抽出プロセスは、細粒の空間的詳細よりもグローバルコンテキストを優先し、ローカリゼーション能力が弱くなります。
この問題に対処するために、視覚的位置プロンプト（VPP）を装備したMLLMであるVPP-llavaを導入して、その接地機能を改善します。
VPP-llavaは、2つの相補的メカニズムを統合します。
グローバルVPPオーバーレイは、入力画像に軸のような埋め込みを学習可能で、構造化された空間キューを提供します。
ローカルVPPは、オブジェクトの位置を示唆する位置認識クエリを組み込むことにより、細粒のローカリゼーションに焦点を当てています。
また、0.6mのサンプルを備えたVPP-SFTデータセットを導入し、高品質の視覚的接地データを効率的なモデルトレーニングのためにコンパクト形式に統合します。
VPPを使用したこのデータセットでのトレーニングは、モデルのパフォーマンスを向上させ、Minigpt-V2などの他のMLLMと比較してより少ないトレーニングサンプルを使用しているにもかかわらず、標準の接地ベンチマークで最先端の結果を達成します。
コードとVPP-SFTデータセットは、受け入れればhttps://github.com/waynetomas/vpp-llavaで入手できます。

要約(オリジナル)

Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model’s performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at https://github.com/WayneTomas/VPP-LLaVA upon acceptance.

arxiv情報

著者	Wei Tang,Yanpeng Sun,Qinying Gu,Zechao Li
発行日	2025-03-19 17:08:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Position Prompt for MLLM based Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー