The Hard Positive Truth about Vision-Language Compositionality

要約

いくつかのベンチマークは、私たちの最良のビジョン言語モデル (CLIP など) は構成性に欠けていると結論付けています。
これらのベンチマークは、画像が与えられた場合、一連の構成上のディストラクターの中から関連するキャプションを識別するモデルの能力を調査します。
これに応じて、最近の提案の急増は、ハードネガとしてディストラクタを使用して CLIP を微調整することによる改善を示しています。
私たちの調査では、これらの改善が実際には大幅に誇張されていることが明らかになりました。なぜなら、既存のベンチマークは、微調整された視覚言語モデルがハードポジティブに対して不変であるかどうかを調査していないからです。
112,382 個のハードネガとハードポジティブを含む評価データセットを厳選することで、ハードポジティブを含めると CLIP のパフォーマンスが 12.9% 低下するのに対し、人間のパフォーマンスは 99% で楽に達成できることがわかりました。
CLIP をハードネガで微調整すると、最大 38.7% というさらに大きな減少が得られます。
この発見に基づいて、ハードネガティブキャプションとハードポジティブキャプションの両方を含む 1,775,259 個の画像テキストトレーニングセットを作成します。
両方を使用してトレーニングすると、既存のベンチマークが向上すると同時にハードポジティブのパフォーマンスが向上し、構成性がより確実に向上していることがわかります。
私たちの研究は、関連する「肯定的な」概念間の意味的関係についての CLIP の理解を厳密にテストし、改善するための将来の研究の必要性を示唆しています。

要約(オリジナル)

Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model’s ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated — because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP’s performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related ‘positive’ concepts.

arxiv情報

著者	Amita Kamath,Cheng-Yu Hsieh,Kai-Wei Chang,Ranjay Krishna
発行日	2024-09-26 15:36:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Hard Positive Truth about Vision-Language Compositionality

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー