Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

要約

視覚と言語のモデリングにおける最近の進歩により、マルチモーダル推論タスクで驚くべきパフォーマンスを達成する Transformer アーキテクチャが開発されました。
しかし、これらのブラックボックスモデルの正確な機能はまだよくわかっていません。
これまでの研究の多くは、単語レベルで意味を学習する能力の研究に焦点を当てていましたが、単語間の構文上の依存関係を追跡する能力はあまり注目されていませんでした。
制御された設定で述語と名詞の依存関係の理解を評価することを目的とした新しいマルチモーダルタスクを作成することにより、このギャップを埋めるための最初の一歩を踏み出しました。
さまざまな最先端のモデルを評価したところ、タスクでのパフォーマンスはかなり異なり、一部のモデルは比較的うまく機能し、他のモデルは偶然のレベルで機能することがわかりました.
この変動性を説明するために、私たちの分析は、事前トレーニングデータの質 (量だけでなく) が不可欠であることを示しています。
さらに、最高のパフォーマンスを発揮するモデルは、標準の画像とテキストのマッチングの目的に加えて、きめ細かいマルチモーダルの事前トレーニングの目的を活用します。
この研究は、対象を絞って管理された評価が、視覚と言語モデルのマルチモーダルな知識を正確かつ厳密にテストするための重要なステップであることを強調しています。

要約(オリジナル)

Recent advances in vision-and-language modeling have seen the development of Transformer architectures that achieve remarkable performance on multimodal reasoning tasks. Yet, the exact capabilities of these black-box models are still poorly understood. While much of previous work has focused on studying their ability to learn meaning at the word-level, their ability to track syntactic dependencies between words has received less attention. We take a first step in closing this gap by creating a new multimodal task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup. We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably, with some models performing relatively well and others at chance level. In an effort to explain this variability, our analyses indicate that the quality (and not only sheer quantity) of pretraining data is essential. Additionally, the best performing models leverage fine-grained multimodal pretraining objectives in addition to the standard image-text matching objectives. This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.

arxiv情報

著者	Mitja Nikolaus,Emmanuelle Salin,Stephane Ayache,Abdellah Fourtassi,Benoit Favre
発行日	2022-10-21 16:07:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー