GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

要約

Group Propagation Vision Transformer (GPViT) を提示します。これは、高解像度機能を備えた一般的な視覚認識用に設計された新しい非階層型 (つまり、非ピラミッド型) 変換モデルです。
高解像度の機能 (またはトークン) は、検出やセグメンテーションなどのきめ細かな詳細を認識するタスクに自然に適合しますが、これらの機能間でグローバルな情報を交換すると、自己注意がスケーリングされるため、メモリと計算にコストがかかります。
グローバルな情報を交換するための非常に効率的な代替グループ伝播ブロック (GP ブロック) を提供します。
各 GP ブロックでは、機能は最初に一定数の学習可能なグループトークンによってグループ化されます。
次に、グループ化された機能間でグローバル情報が交換されるグループ伝播を実行します。
最後に、更新されたグループ化された機能のグローバル情報が、変換デコーダを介して画像機能に戻されます。
画像分類、セマンティックセグメンテーション、オブジェクト検出、インスタンスセグメンテーションなど、さまざまな視覚認識タスクで GPViT を評価します。
私たちの方法は、特に高解像度の出力を必要とするタスクで、すべてのタスクで以前の作業よりも大幅なパフォーマンスの向上を達成します。
コードと事前トレーニング済みのモデルは、https://github.com/ChenhongyiYang/GPViT で入手できます。

要約(オリジナル)

We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require high-resolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameters. Code and pre-trained models are available at https://github.com/ChenhongyiYang/GPViT .

arxiv情報

著者	Chenhongyi Yang,Jiarui Xu,Shalini De Mello,Elliot J. Crowley,Xiaolong Wang
発行日	2022-12-13 18:26:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー