Language-Guided Diffusion Model for Visual Grounding

要約

視覚的に接地（VG）タスクには、明示的に対応する画像領域が提供される言語フレーズ用に配置されるため、明示的なクロスモーダルアライメントが含まれます。
既存のアプローチは、このような視覚的なテキスト推論を1段階で完了します。
それらのパフォーマンスは、大規模なアンカーに対する高い需要と、人間の事前に基づいた過剰設計のマルチモーダル融合モジュールを引き起こし、特定のシナリオにトレーニングして過剰に採用することが困難な複雑なフレームワークにつながります。
さらに悪いことに、このような1回の推論メカニズムは、クエリリージョンのマッチングを強化するために継続的にボックスを洗練することができません。
対照的に、この論文では、拡散モデリングを除去することにより、反復推論プロセスを策定します。
具体的には、視覚的接地の言語誘導拡散フレームワークであるLG-DVGを提案します。LG-DVGは、言語ガイドでノイズの多いボックスのセットを除去することにより、クエリのオブジェクトボックスを徐々に推論するようにモデルをトレーニングします。
これを達成するために、LG-DVGは、クエリセマンティクスを条件として、騒々しいものにクエリに合わせたグラウンドトゥルースボックスに徐々にグラウンドトゥルースボックスを段階的に逆転させます。
広く使用されている5つのデータセットに関する提案されたフレームワークの広範な実験は、生成的な方法で、モーダルアライメントタスクである視覚的接地を解くことの優れたパフォーマンスを検証します。
ソースコードは、https：//github.com/iqua/vgbase/tree/main/examples/diffusionvgで入手できます。

要約(オリジナル)

Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.

arxiv情報

著者	Sijia Chen,Baochun Li
発行日	2025-02-25 14:41:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language-Guided Diffusion Model for Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー