Natural Language Inference Improves Compositionality in Vision-Language Models

要約

視覚言語モデル (VLM) の構成推論は、オブジェクト、属性、空間関係を関連付けることが難しいことが多いため、依然として困難です。
最近の手法は、テキスト記述のセマンティクスに依存し、大規模言語モデル (LLM) を使用して制限を質問と回答のサブセットに分割することで、これらの制限に対処することを目的としています。
ただし、これらの方法は主に表面レベルで動作し、LLM によって生成される誤った仮定を導入しながら、より深い語彙理解を組み込むことができません。
これらの問題に対応して、私たちは、自然言語推論 (NLI) を活用して所定の前提から含意と矛盾を生成する原則的なアプローチである、矛盾と含意を伴うキャプション拡張 (CECE) を紹介します。
CECE は、中心的な意味を維持しながら、語彙的に多様な文を生成します。
広範な実験を通じて、CECE が解釈可能性を高め、偏った特徴や表面的な特徴への過度の依存を減らすことを示しました。
元の前提に沿って CECE のバランスをとることにより、追加の微調整を必要とせずに以前の方法に比べて大幅な改善を達成し、画像とテキストの位置合わせについて人間の判断との一致をスコアするベンチマークで最先端の結果を生成し、
過去最高の研究結果と比較して、Winoground で +19.2% (グループスコア)、EqBen で +12.9% (グループスコア) のパフォーマンス (対象を絞ったデータで微調整)。

要約(オリジナル)

Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of +19.2% (group score) and +12.9% on EqBen (group score) over the best prior work (finetuned with targeted data).

arxiv情報

著者	Paola Cascante-Bonilla,Yu Hou,Yang Trista Cao,Hal Daumé III,Rachel Rudinger
発行日	2024-10-29 17:54:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Natural Language Inference Improves Compositionality in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー