CREPE: Can Vision-Language Foundation Models Reason Compositionally?

要約

人間の視覚と自然言語の両方に共通する基本的な特徴は、その構成的な性質です。
しかし、大規模な視覚と言語の事前トレーニングによってパフォーマンスが向上したにもかかわらず、大規模なデータセット上で 4 つのアルゴリズムでトレーニングされた 7 つのアーキテクチャ全体で、構成性に苦戦していることがわかりました。
この結論に到達するために、認知科学文献によって特定された構成性の 2 つの重要な側面、体系性と生産性を測定する新しい構成性評価ベンチマーク CREPE を導入します。
系統性を測定するために、CREPE は、37 万ドルを超える画像とテキストのペアと 3 つの異なる可視と未可視の分割を含むテストデータセットで構成されています。
3 つの分割は、CC-12M、YFCC-15M、LAION-400M の 3 つの一般的なトレーニングデータセットでトレーニングされたモデルをテストするように設計されています。
また、ペアのサブセットに対して $325,000、$316,000、および $309,000 のハードネガティブキャプションを生成します。
生産性をテストするために、CREPE には 9 つの異なる複雑さを持つ 17,000 ドルの画像とテキストのペアに加え、アトミック、スワッピング、および否定フォイルを備えた 183,000 ドルのハードネガティブキャプションが含まれています。
データセットは、ビジュアルゲノムシーングラフと領域説明を再利用し、手作りのテンプレートと GPT-3 を適用することによって生成されます。
体系性については、新しい構成が検索セットを支配する場合、モデルのパフォーマンスが一貫して低下し、Recall@1 が最大 $12\%$ 低下することがわかりました。
生産性の観点からは、複雑さが増すにつれてモデルの検索成功率は低下し、複雑度が高くなるとランダムな確率に近づくことがよくあります。
これらの結果は、モデルとトレーニングデータセットのサイズに関係なく当てはまります。

要約(オリジナル)

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models’ retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

arxiv情報

著者	Zixian Ma,Jerry Hong,Mustafa Omer Gul,Mona Gandhi,Irena Gao,Ranjay Krishna
発行日	2023-05-16 16:27:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー