Image Difference Grounding with Natural Language

要約

視覚的接地（VG）は通常、自然言語を使用して画像内の関心領域を見つけることに焦点を当てており、ほとんどの既存のVGメソッドは単一イメージの解釈に限定されています。
これにより、複数の画像間で微妙でありながら意味のある視覚的な違いを検出することが非常に重要である自動監視など、実際のシナリオでの適用性が制限されます。
その上、画像の違い理解に関する以前の研究（IDU）は、モーダルのテキストガイダンスなしですべての変更領域を検出するか、違いの粗粒の説明を提供することに焦点を当てています。
したがって、より細かい粒度のビジョン言語知覚に向けて、ユーザーの指示に基づいて視覚的な違いを正確にローカライズするように設計されたタスクである画像差グラウンド（IDG）を提案します。
IDG用の大規模で高品質のデータセットであるDiffgroundを紹介します。これは、さまざまな視覚的変動を備えた画像ペアを含み、微調整された違いをクエリする手順を紹介します。
また、IDGのベースラインモデルであるDifftrackerを提示します。これは、特徴の微分強化と共通の抑制を効果的に統合して、違いを正確に見つけます。
Diffgroundデータセットでの実験では、細かい粒子のIDUを有効にする際のIDGデータセットの重要性を強調しています。
将来の研究を促進するために、DiffgroundデータとDifftrackerモデルの両方が公開されます。

要約(オリジナル)

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial. Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences. Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions. We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences. Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU. To foster future research, both DiffGround data and DiffTracker model will be publicly released.

arxiv情報

著者	Wenxuan Wang,Zijia Zhao,Yisi Zhang,Yepeng Tang,Erdong Hu,Xinlong Wang,Jing Liu
発行日	2025-04-02 17:56:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Image Difference Grounding with Natural Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー