Retrieval-based Knowledge Augmented Vision Language Pre-training

要約

タイトル：検索ベースの知識付与ビジョン言語事前学習

要約：
-最近の大規模なビジョンおよび言語表現学習の進歩により、ビジョン言語事前学習（VLP）モデルは、さまざまなマルチモーダルダウンストリームタスクで有望な改善を達成しています。
-これらの事前学習モデルは、世界の知識を活用していませんが、これはマルチモーダルデータに含まれる暗黙のものであり、豊富で補完的な情報を含んでいます。
-本研究では、REtrieval-based knowledge Augmented Vision Language Pre-training model（REAVL）を提案し、知識グラフ（KG）から世界の知識を検索してビジョン・ランゲージ事前学習に組み込みます。
-REAVLには、マルチモーダルデータを与えられた知識を取り出す知識リトリーバーと、マルチモーダルデータと知識を融合する知識付与モデルの2つのコアコンポーネントがあります。
-4つの知覚的な自己教育タスクを新しく統合することにより、REAVLは、マスクされたマルチモーダルデータモデリングとKG関係推論のためにビジョン言語ペアに明示的な知識を融合することにより、マルチモーダルデータと知識の相互統合を促進します。
-経験的な実験で、REAVLは、知識に基づくビジョン言語理解とマルチモーダルエンティティリンキングタスクで新しい最先端のパフォーマンスを一様に達成し、一般的なビジョン言語タスクでは、最高のモデルの0.2％しか使用しないが競争力のある結果を示しています。

要約(オリジナル)

With recent progress in large-scale vision and language representation learning, Vision Language Pretraining (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these pre-training models still do not take advantage of world knowledge, which is implicit in multi-modal data but comprises abundant and complementary information. In this work, we propose a REtrieval-based knowledge Augmented Vision Language Pre-training model (REAVL), which retrieves world knowledge from knowledge graphs (KGs) and incorporates them in vision-language pre-training. REAVL has two core components: a knowledge retriever that retrieves knowledge given multi-modal data, and a knowledge-augmented model that fuses multi-modal data and knowledge. By novelly unifying four knowledge-aware self-supervised tasks, REAVL promotes the mutual integration of multi-modal data and knowledge by fusing explicit knowledge with vision-language pairs for masked multi-modal data modeling and KG relational reasoning. Empirical experiments show that REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multimodal entity linking tasks, and competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models.

arxiv情報

著者	Jiahua Rao,Zifei Shan,Longpo Liu,Yao Zhou,Yuedong Yang
発行日	2023-04-27 02:23:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Retrieval-based Knowledge Augmented Vision Language Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー