Query-guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch

要約

この作業では、自然画像でのスケッチベースのオブジェクトローカリゼーションの問題を調査します。オブジェクトの大まかな手描きのスケッチが与えられた場合、目標は、ターゲットイメージ上の同じオブジェクトのすべてのインスタンスをローカライズすることです。
この問題は、手描きのスケッチの抽象的な性質、スケッチのスタイルと品質のばらつき、およびスケッチと自然画像の間に存在する大きなドメインギャップのために困難であることがわかります。
これらの課題を軽減するために、既存の作品では、クエリ情報を画像の特徴に組み込むための注意ベースのフレームワークが提案されています。
ただし、これらの作業では、画像の特徴が既に独立して学習された後にクエリの特徴が組み込まれているため、不適切なアライメントが発生します。
対照的に、トランスフォーマーベースの画像エンコーダーの各ブロックの後にクロスアテンションを使用して、クエリ条件付きの画像機能を学習し、クエリスケッチとのより強力なアライメントにつながる、スケッチガイド付きビジョントランスフォーマーエンコーダーを提案します。
さらに、デコーダの出力では、オブジェクトとスケッチの特徴が洗練されて、関連するオブジェクトの表現がスケッチクエリに近づき、それによってローカリゼーションが改善されます。
提案されたモデルは、トレーニング中に見られなかったオブジェクトカテゴリにも一般化されます。これは、この方法で学習されたターゲット画像の特徴がクエリ認識であるためです。
当社のローカリゼーションフレームワークは、トレーニング可能な斬新なスケッチフュージョン戦略を介して複数のスケッチクエリを利用することもできます。
このモデルは、QuickDraw のスケッチクエリを使用して、公開オブジェクト検出ベンチマーク、つまり MS-COCO の画像で評価されます。
大ざっぱなデータセット。
既存のローカリゼーション方法と比較して、提案されたアプローチは、QuickDraw からのスケッチクエリを使用して、表示されたオブジェクトの mAP を $6.6\%$ および $8.0\%$ 改善します!
トレーニング中に「見えない」大きなオブジェクトの AP@50 が $12.2\%$ 改善されました。

要約(オリジナル)

In this work, we investigate the problem of sketch-based object localization on natural images, where given a crude hand-drawn sketch of an object, the goal is to localize all the instances of the same object on the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images. To mitigate these challenges, existing works proposed attention-based frameworks to incorporate query information into the image features. However, in these works, the query features are incorporated after the image features have already been independently learned, leading to inadequate alignment. In contrast, we propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features leading to stronger alignment with the query sketch. Further, at the output of the decoder, the object and the sketch features are refined to bring the representation of relevant objects closer to the sketch query and thereby improve the localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by our method are query-aware. Our localization framework can also utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images from the public object detection benchmark, namely MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a $6.6\%$ and $8.0\%$ improvement in mAP for seen objects using sketch queries from QuickDraw! and Sketchy datasets, respectively, and a $12.2\%$ improvement in AP@50 for large objects that are `unseen’ during training.

arxiv情報

著者	Aditay Tripathi,Anand Mishra,Anirban Chakraborty
発行日	2023-03-15 17:26:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Query-guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー