Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

要約

参照表現の接地はコンピュータビジョンにおいて重要かつ困難な課題である。従来の参照表現接地における手間のかかるアノテーションを避けるため、学習データが対応関係のない多数の画像とクエリのみからなる非対参照表現接地が導入された。しかし、画像-テキストマッチングの学習が困難であることや、非対応データに対するトップダウンガイダンスがないことから、非対応参照接地に対する既存のいくつかの解決策は、まだ予備的である。本論文では、これらの課題に対処するために、新しい双方向クロスモーダル照合（BiCM）フレームワークを提案する。特に、クエリに特化した視覚的アテンションマップを生成することにより、トップダウン的な視点を導入するクエリ対応アテンションマップ（QAM）モジュールを設計する。さらに、クロスモーダル物体照合（COM）モジュールを導入し、最近登場した画像-テキスト照合事前学習モデルCLIPを利用して、ボトムアップの視点からターゲット物体を予測する。トップダウンとボトムアップの予測は、類似性関数(SF)モジュールを介して統合される。また、知識適応マッチング(KAM)モジュールを提案し、ペアリングされていない学習データを活用して、事前に学習した知識をターゲットデータセットとタスクに適応させる。実験によれば、我々のフレームワークは、2つの有名な接地データセットにおいて、先行研究を6.55%と9.94%上回る性能を示した。

要約(オリジナル)

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55% and 9.94% on two popular grounding datasets.

arxiv情報

著者	Hengcan Shi,Munawar Hayat,Jianfei Cai
発行日	2022-06-05 17:29:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー