Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

要約

ツータワーアーキテクチャを備えた視覚言語（VL）モデルは、近年、視覚言語表現学習を支配してきました。
現在のVLモデルは、軽量のユニモーダルエンコーダーを使用し、クロスモーダルエンコーダーで両方のモダリティを同時に抽出、整列、融合することを学習するか、セマンティックを無視して、最終層のユニモーダル機能を最上位のクロスモーダルエンコーダーに直接フィードします。
ディープユニモーダルエンコーダのさまざまなレベルの情報。
どちらのアプローチも、視覚言語表現の学習を制限し、モデルのパフォーマンスを制限する可能性があります。
このホワイトペーパーでは、ユニモーダルエンコーダの最上位層とクロスモーダルエンコーダの各層の間に接続を構築する複数のブリッジ層を紹介します。
これにより、さまざまなセマンティックレベルでの視覚的表現とテキスト表現の間の包括的なボトムアップの相互作用が可能になり、より効果的なクロスモーダルアラインメントと融合が実現します。
提案されたブリッジタワーは、わずか400万ドルの画像で事前にトレーニングされており、さまざまなダウンストリームの視覚言語タスクで最先端のパフォーマンスを実現します。
VQAv2 test-stdセットでは、Bridge-Towerは$ 78.73 \％$の精度を達成し、同じ事前トレーニングデータとほとんど追加のパラメーターなしで、以前の最先端のMETERモデルを$ 1.09 \％$上回っています。
計算コスト。
特に、モデルをさらにスケーリングすると、Bridge-Towerは$ 81.15 \％$の精度を達成し、桁違いに大きなデータセットで事前トレーニングされたモデルを上回ります。
コードはhttps://github.com/microsoft/BridgeTowerで入手できます。

要約(オリジナル)

Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Both approaches possibly restrict vision-language representation learning and limit model performance. In this paper, we introduce multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-modal alignment and fusion. Our proposed Bridge-Tower, pre-trained with only $4$M images, achieves state-of-the-art performance on various downstream vision-language tasks. On the VQAv2 test-std set, Bridge-Tower achieves an accuracy of $78.73\%$, outperforming the previous state-of-the-art METER model by $1.09\%$ with the same pre-training data and almost no additional parameters and computational cost. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of $81.15\%$, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code is available at https://github.com/microsoft/BridgeTower.

arxiv情報

著者	Xiao Xu,Chenfei Wu,Shachar Rosenman,Vasudev Lal,Nan Duan
発行日	2022-06-17 09:42:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー