Probing the Role of Positional Information in Vision-Language Models

要約

ほとんどの視覚言語モデル (VL) では、画像内のオブジェクトに関する位置情報 (PI) を注入することで画像構造を理解できるようになります。
最先端の VL モデルである LXMERT のケーススタディでは、表現における PI の使用を調査し、視覚的質問応答に対するその効果を研究します。
このモデルでは、位置のみが異なるチャレンジセットでの画像とテキストのマッチングタスクに PI を活用できないことを示します。
しかし、私たちのプロービング実験により、PI が表現内に実際に存在することが確認されました。
これに取り組むための 2 つの戦略を紹介します。(i) 位置情報の事前トレーニングと (ii) クロスモダリティマッチングを使用した PI の対照学習です。
そうすることで、モデルは詳細な PI ステートメントを含む画像が一致するかどうかを正しく分類できます。
境界ボックスからの 2D 情報に加えて、空間内でのオブジェクトの位置特定を改善するための新しい機能としてオブジェクトの深度を導入します。
プローブによって定義されたモデルのプロパティを改善できたとしても、それがダウンストリームのパフォーマンスに与える影響はごくわずかです。
したがって、私たちの結果は、マルチモーダルモデリングの重要な問題を浮き彫りにします。つまり、精査分類器によって検出可能な情報が単に存在するだけでは、その情報がクロスモーダル設定で利用可能であるという保証はありません。

要約(オリジナル)

In most Vision-Language models (VL), the understanding of the image structure is enabled by injecting the position information (PI) about objects in the image. In our case study of LXMERT, a state-of-the-art VL model, we probe the use of the PI in the representation and study its effect on Visual Question Answering. We show that the model is not capable of leveraging the PI for the image-text matching task on a challenge set where only position differs. Yet, our experiments with probing confirm that the PI is indeed present in the representation. We introduce two strategies to tackle this: (i) Positional Information Pre-training and (ii) Contrastive Learning on PI using Cross-Modality Matching. Doing so, the model can correctly classify if images with detailed PI statements match. Additionally to the 2D information from bounding boxes, we introduce the object’s depth as new feature for a better object localization in the space. Even though we were able to improve the model properties as defined by our probes, it only has a negligible effect on the downstream performance. Our results thus highlight an important issue of multimodal modeling: the mere presence of information detectable by a probing classifier is not a guarantee that the information is available in a cross-modal setup.

arxiv情報

著者	Philipp J. Rösch,Jindřich Libovický
発行日	2023-05-17 08:38:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probing the Role of Positional Information in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー