Teaching Structured Vision&Language Concepts to Vision&Language Models

要約

視覚と言語 (VL) モデルは、さまざまなタスクで驚くべきゼロショットパフォーマンスを示しています。
ただし、複雑な言語理解のいくつかの側面は依然として課題のままです。
テキストに存在し、画像に表示されるオブジェクトの属性、関係、および状態を含む、構造化されたビジョンと言語の概念 (SVLC) の集合的な概念を紹介します。
最近の研究では、最高の VL モデルでさえ SVLC と闘っていることが示されています。
この問題を解決する方法として、各 SVLC タイプを教える専用のデータセットを収集することが考えられますが、これには費用と時間がかかる可能性があります。
代わりに、既存の VL 事前トレーニングデータセットをより効果的に利用し、追加のデータを必要としない VL モデルの SVLC の理解を強化するための、より洗練されたデータ駆動型アプローチを提案します。
画像構造の自動理解はまだほとんど解決されていませんが、言語構造ははるかによくモデル化され理解されており、VL モデルを教える際に効果的に利用することができます。
この論文では、既製の対になったVLデータセットのテキスト部分を操作するために使用できる、言語構造の理解に基づくさまざまな手法を提案します。
更新されたデータでトレーニングされた VL モデルは、SVLC の理解において最大 15% の大幅な改善を示しますが、最初からトレーニングする場合も、事前トレーニング済みのモデルを微調整する場合も、ゼロショット機能がわずかに低下するだけです。

要約(オリジナル)

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models’ understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.

arxiv情報

著者	Sivan Doveh,Assaf Arbelle,Sivan Harary,Rameswar Panda,Roei Herzig,Eli Schwartz,Donghyun Kim,Raja Giryes,Rogerio Feris,Shimon Ullman,Leonid Karlinsky
発行日	2022-11-21 18:54:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Teaching Structured Vision&Language Concepts to Vision&Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー