DALG: Deep Attentive Local and Global Modeling for Image Retrieval

要約

深層学習された表現は、retrieve-then-rerank方式で優れた画像検索性能を達成している。また、局所特徴量と大域特徴量をヒューリスティックに融合させた最新の単一ステージモデルは、効率と有効性のトレードオフを実現することが可能である。しかし，マルチスケール推論を行うため，効率性にはまだ限界がある．本論文では、シングルステージの技術を踏襲し、マルチスケールテストをうまく取り除くことで、さらなる複雑性と効果のバランスを獲得する。この目標を達成するために、多様な視覚的パターンの探索に限界があるため、広く使われている畳み込みネットワークを放棄し、Transformerの成功に動機づけられた頑健な表現学習のための完全に注意に基づいたフレームワークに頼ることにする。Transformerを大域的な特徴抽出に応用するだけでなく、局所的な画像パターンを完全に利用するために、ウィンドウベースのマルチヘッド注意と空間注意で構成される局所ブランチを考案する。さらに、我々は、従来の技術のようにヒューリスティックな融合を用いるのではなく、クロスアテンションモジュールを介して階層的なローカルおよびグローバル特徴を結合することを提案する。我々のDeep Attentive Local and Global modeling framework (DALG)により、広範な実験結果から、従来の技術に負けない結果を維持しながら、効率を大幅に改善できることが示された。

要約(オリジナル)

Deeply learned representations have achieved superior image retrieval performance in a retrieve-then-rerank manner. Recent state-of-the-art single stage model, which heuristically fuses local and global features, achieves promising trade-off between efficiency and effectiveness. However, we notice that efficiency of existing solutions is still restricted because of their multi-scale inference paradigm. In this paper, we follow the single stage art and obtain further complexity-effectiveness balance by successfully getting rid of multi-scale testing. To achieve this goal, we abandon the widely-used convolution network giving its limitation in exploring diverse visual patterns, and resort to fully attention based framework for robust representation learning motivated by the success of Transformer. Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention to fully exploit local image patterns. Furthermore, we propose to combine the hierarchical local and global features via a cross-attention module, instead of using heuristically fusion as previous art does. With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved while maintaining competitive results with the state of the arts.

arxiv情報

著者	Yuxin Song,Ruolin Zhu,Min Yang,Dongliang He
発行日	2022-07-01 09:32:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

DALG: Deep Attentive Local and Global Modeling for Image Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー