Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

要約

ゼロショット参照表現の理解は、提供されたテキストプロンプトに対応する画像内の境界ボックスの位置を特定することを目的としています。これには、(i) 複雑な視覚的シーンとテキストのコンテキストをきめ細かく解きほぐすこと、および (ii) それらの間の関係を理解する能力が必要です。
絡み合っていない実体。
残念ながら、CLIP などの既存の大規模なビジョン言語アライメント (VLA) モデルは両方の側面に問題があるため、このタスクに直接使用することはできません。
このギャップを軽減するために、大規模な基礎モデルを活用して、画像とテキストの両方を (主語、述語、目的語) の形式の 3 つの要素に分解します。
その後、VLA モデルを使用して視覚的トリプレットとテキストトリプレット間の構造的類似性マトリックスを計算することによってグラウンディングが達成され、その後それがインスタンスレベルの類似性マトリックスに伝播されます。
さらに、VLA モデルに関係を理解する機能を装備するために、豊富なエンティティ関係を含む厳選されたデータセットのコレクションで VLA モデルを微調整するトリプレットマッチング目標を設計します。
実験では、RefCOCO/+/g の SOTA ゼロショットモデルと比較して、ビジュアルグラウンディングパフォーマンスが最大 19.5% 向上することが実証されています。
より困難な Who’s Waldo データセットでは、ゼロショットアプローチにより、完全教師ありモデルと同等の精度が達成されます。

要約(オリジナル)

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to the provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who’s Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

arxiv情報

著者	Zeyu Han,Fangrui Zhu,Qianru Lao,Huaizu Jiang
発行日	2023-11-28 18:55:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー