Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

要約

マルチモーダルビジョン言語モデルのスケールアップに向けた最近の進歩にもかかわらず、これらのモデルは、Winoground などの合成一般化ベンチマークでまだ苦労していることで知られています。
現在の視覚言語モデルに欠けている重要なコンポーネントは、関係レベルのアラインメントであることがわかりました。これは、テキスト内の方向の意味関係 (「草の中のマグカップ」など) を画像内の空間関係 (たとえば、
草に対するマグ）。
この問題に取り組むために、「マグカップ」から「草」への言語の注意を促し (セマンティックな関係「in」をキャプチャ)、マグカップから草への視覚的注意を一致させることで、関係の調整を強制できることを示します。
トークンとそれに対応するオブジェクトは、クロスモーダルアテンションを使用してソフトに識別されます。
このソフトな関係の調整の概念は、クロスモーダルな注意マトリックスによって提供される「基礎の変更」の下で、視覚と言語の注意マトリックスの間の合同を強制することと同等であることを証明します。
直感的に、私たちのアプローチは視覚的注意を言語注意空間に投影して、実際の言語注意からの逸脱を計算します。逆もまた同様です。
UNITER に Cross-modal Attention Congruence Regularization (CACR) 損失を適用し、Winoground への最先端のアプローチを改善します。

要約(オリジナル)

Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., ‘mug in grass’) with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from ‘mug’ to ‘grass’ (capturing the semantic relation ‘in’) to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a ‘change of basis’ provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.

arxiv情報

著者	Rohan Pandey,Rulin Shao,Paul Pu Liang,Ruslan Salakhutdinov,Louis-Philippe Morency
発行日	2022-12-20 18:53:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー