Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

要約

視覚言語基盤モデルは、画像検索、分類、キャプションなどのさまざまなゼロショット設定で顕著なパフォーマンスを示しています。
しかし今のところ、これらのモデルは、画像内の参照式やオブジェクトのゼロショット位置特定に関しては後れを取っているようです。
そのため、このタスクに合わせて微調整する必要があります。
この論文では、事前トレーニングされたビジョン言語 (VL) モデルにより、微調整なしでゼロショットのオープン語彙オブジェクト位置特定が可能になることを示します。
これらの機能を活用するために、CLIPSurgery によって導入された価値対価値の注意の概念を自己自己注意のパスに一般化する Grounding Everything Module (GEM) を提案します。
我々は、自己注意の概念がクラスタリングに対応し、言語空間との整合性を維持しながら、同じオブジェクトから生じるトークンのグループが類似することを強制することを示します。
グループ形成をさらにガイドするために、モデルがデータセットとバックボーン全体で最終的に一般化できるようにする一連の正則化を提案します。
提案されている GEM フレームワークを、セマンティックセグメンテーションのさまざまなベンチマークタスクとデータセットで評価します。
これは、GEM が他のトレーニング不要のオープン語彙ローカリゼーション手法よりも優れているだけでなく、最近提案された OpenImagesV7 大規模セグメンテーションベンチマークでも最先端の結果を達成していることを示しています。

要約(オリジナル)

Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.

arxiv情報

著者	Walid Bousselham,Felix Petersen,Vittorio Ferrari,Hilde Kuehne
発行日	2023-12-05 16:39:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー