Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

要約

テキストと画像を組み合わせた大規模なニューラルネットワークモデルは、近年驚くべき進歩を遂げています。
ただし、そのようなモデルが、構成要素「赤」と「立方体」を推論することによって「赤い立方体」を正しく識別するなど、モデルが機能する概念の構成表現をどの程度エンコードしているかは未解決の問題のままです。
この研究では、構成概念をエンコードし、構造に依存した方法で変数をバインドする（たとえば、「球の後ろの立方体」と「立方体の後ろの球」を区別するなど）大規模な事前トレーニング済みビジョンおよび言語モデル（CLIP）の機能に焦点を当てます。
。
CLIP のパフォーマンスを検査するために、埋め込み空間内で伝統的な構成言語構造を実装しようとする一連の研究である構成分布意味論モデル (CDSM) に関する研究からのいくつかのアーキテクチャを比較します。
コンセプトバインディングをテストするために設計された 3 つの合成データセット (単一オブジェクト、2 オブジェクト、リレーショナル) でベンチマークを行います。
CLIP は単一オブジェクト設定でコンセプトを構成できるが、コンセプトのバインディングが必要な状況ではパフォーマンスが大幅に低下することがわかりました。
同時に、CDSM のパフォーマンスも低く、チャンスレベルで最高のパフォーマンスが得られます。

要約(オリジナル)

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying ‘red cube’ by reasoning over the constituents ‘red’ and ‘cube’. In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating ‘cube behind sphere’ from ‘sphere behind cube’). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets – single-object, two-object, and relational – designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.

arxiv情報

著者	Martha Lewis,Nihal V. Nayak,Peilin Yu,Qinan Yu,Jack Merullo,Stephen H. Bach,Ellie Pavlick
発行日	2024-08-30 04:51:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー