BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

要約

SugarCrepe のような既存の視覚言語構成性 (VLC) ベンチマークは、画像からテキストへの検索問題として定式化されており、画像が与えられた場合、モデルは正しいテキスト記述と合成ハードネガティブテキストのどちらかを選択する必要があります。
この研究では、双方向視覚言語構成性 (BiVLC) データセットを紹介します。
BiVLC の新規性は、合成テキストから生成された合成ハードネガイメージを追加することです。その結果、2 つの画像からテキストへの検索例 (画像ごとに 1 つ) と、さらに重要なことに、2 つのテキストから画像への検索例 (画像ごとに 1 つ) が得られます。
各テキスト）。
ヒューマンアノテーターは、不正な形式の例を除外して、ベンチマークの有効性を確保します。
BiVLC の実験では、テキストから画像への方向でのパフォーマンスが低いという、現在のマルチモーダルモデルの弱点が明らかになりました。
実際、両方の検索方向を考慮すると、以前の研究で得られた結論は大きく変わります。
ベンチマークに加えて、合成画像とテキストを使用してトレーニングされた対照モデルにより、SugarCrepe と BiVLC の両方の検索方向の最先端技術が向上することを示します。
BiVLC における人間のパフォーマンスとのギャップは、視覚と言語の構成性が依然として困難な問題であることを裏付けています。
BiVLC とコードは https://imirandam.github.io/BiVLC_project_page で入手できます。

要約(オリジナル)

Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts improves the state of the art in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.

arxiv情報

著者	Imanol Miranda,Ander Salaberria,Eneko Agirre,Gorka Azkune
発行日	2024-06-14 11:58:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー