When and why vision-language models behave like bags-of-words, and what to do about it?

要約

多くのダウンストリームアプリケーションで大規模なビジョンおよび言語モデル (VLM) が成功しているにもかかわらず、構成情報をどの程度うまくエンコードしているかは不明です。
ここでは、属性、関係、および順序 (ARO) ベンチマークを作成して、さまざまなタイプの関係、属性、および順序を理解する VLM の能力を体系的に評価します。
ARO は、オブジェクトのプロパティの理解をテストするための Visual Genome Attribution で構成されています。
Visual Genome Relation: 関係理解をテストします。
COCO & Flickr30k-Order は、注文の感度をテストします。
ARO は、以前の構成性のベンチマークよりも桁違いに大きく、50,000 を超えるテストケースがあります。
最先端の VLM がリレーショナルな理解が乏しく、オブジェクトをその属性にリンクするときに失敗する可能性があり、順序の感度が著しく欠如していることを示します。
VLM は主に、画像とキャプションに豊富な構成構造を持つ大規模なデータセットでトレーニングおよび評価されます。
しかし、これらのデータセットのトレーニングは、構成に関する理解の欠如に対処するには不十分であり、これらのデータセットを評価しても、この欠陥を表面化することはできませんでした。
これらの制限が発生し、標準テストで表されない理由を理解するために、評価とトレーニングの手順を拡大します。
構成と順序情報を使用せずに、既存のデータセットの検索をうまく実行できることを示します。
対照的な事前トレーニングが同様のショートカットを持つデータセットでの検索を最適化することを考えると、モデルが構成情報を表現することを学習する必要がない理由を説明できると仮定します。
この発見は、自然な解決策を示唆しています: 構成を意識したハードネガティブマイニングです。
対照学習の実装が簡単な変更により、順序と構成性の理解が必要なタスクのパフォーマンスが大幅に向上することを示します。

要約(オリジナル)

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects’ properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.

arxiv情報

著者	Mert Yuksekgonul,Federico Bianchi,Pratyusha Kalluri,Dan Jurafsky,James Zou
発行日	2023-03-23 23:21:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When and why vision-language models behave like bags-of-words, and what to do about it?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー