Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

要約

マルチモーダルトランスフォーマーは、画像とテキストを並べて視覚的な接地を行うための高い容量と柔軟性を備えています。
ただし、エンコーダーのみのグラウンディングフレームワーク (TransVG など) は、2 次時間の複雑さを伴う自己注意操作により、重い計算に悩まされます。
この問題に対処するために、接地プロセス全体をエンコードフェーズとデコードフェーズに分離することにより、ダイナミック MDETR と呼ばれる新しいマルチモーダルトランスアーキテクチャを提示します。
重要な観察結果は、画像に高い空間的冗長性が存在することです。
したがって、ビジュアルグラウンディングプロセスを高速化する前に、このスパース性を利用して、新しい動的マルチモーダルトランスフォーマーデコーダーを考案します。
具体的には、ダイナミックデコーダーは、2D 適応サンプリングモジュールとテキストガイド付きデコードモジュールで構成されます。
サンプリングモジュールは、基準点に対するオフセットを予測することによってこれらの有益なパッチを選択することを目的としています。一方、デコードモジュールは、画像の特徴とテキストの特徴の間で相互注意を実行することによって、接地されたオブジェクト情報を抽出するために機能します。
これらの 2 つのモジュールは交互に積み重ねられて、モダリティのギャップを徐々に埋め、接地されたオブジェクトの基準点を繰り返し改良し、最終的に視覚的な接地の目的を実現します。
5 つのベンチマークでの広範な実験により、提案された動的 MDETR が計算と精度の間で競争力のあるトレードオフを達成することが実証されました。
特に、デコーダーで 9% の特徴点のみを使用すると、マルチモーダルトランスフォーマーの GLOP を最大 44% 削減できますが、エンコーダーのみの対応するものよりも高い精度が得られます。
さらに、その一般化能力を検証し、Dynamic MDETR をスケールアップするために、最初の 1 段階の CLIP で強化されたビジュアルグラウンディングフレームワークを構築し、これらのベンチマークで最先端のパフォーマンスを達成します。

要約(オリジナル)

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic MDETR, by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text-guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ~44% GLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

arxiv情報

著者	Fengyuan Shi,Ruopeng Gao,Weilin Huang,Limin Wang
発行日	2022-09-28 09:43:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー