An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

要約

先進的な視覚的グラウンディング手法の多くは、視覚的特徴量と言語的特徴量の融合にTransformerを利用している。特に高解像度画像や長い文脈文を扱う場合、Transformer Encoderの自己注意メカニズムにより、計算コストは二次関数的に増大する。この二次関数的な計算負荷の増大は、長い言語表現を伴う会話ベースの推論セグメンテーションのような、より複雑なシーンへの視覚的グラウンディングの適用を制限する。本論文では、この問題に対処するため、言語と視覚の両側面においてコストを削減する、トランスフォーマーデコーダーに基づく効率的で効果的なマルチタスク視覚接地（EEVG）フレームワークを提案する。言語面では、視覚的特徴と言語的特徴を融合するためにTransformer Decoderを採用し、言語的特徴はメモリとして、視覚的特徴はクエリーとして入力される。これにより、融合は言語表現の長さに対して線形にスケールする。視覚的な側面では、注意スコアに基づいて背景の視覚的トークンを除去することで、計算量を削減するパラメータフリーのアプローチを導入する。次に、残りの疎な特徴マップからセグメンテーションマスクを直接予測するライトマスクヘッドを設計する。ベンチマークを用いた広範な結果とアブレーション研究により、本アプローチの効率性と有効性が実証される。コードはhttps://github.com/chenwei746/EEVG。

要約(オリジナル)

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where linguistic features are input as memory and visual features as queries. This allows fusion to scale linearly with language expression length. In the visual aspect, we introduce a parameter-free approach to reduce computation by eliminating background visual tokens based on attention scores. We then design a light mask head to directly predict segmentation masks from the remaining sparse feature maps. Extensive results and ablation studies on benchmarks demonstrate the efficiency and effectiveness of our approach. Code is available in https://github.com/chenwei746/EEVG.

arxiv情報

著者	Wei Chen,Long Chen,Yu Wu
発行日	2024-08-02 09:01:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー