P2T: Pyramid Pooling Transformer for Scene Understanding

要約

近年、ビジョン変換器は様々なビジョンタスクの最先端を押し進め、大きな成功を収めている。ビジョン変換器における最も困難な問題の1つは、画像トークンのシーケンス長が大きいため、計算コストが高くなることです（2次的複雑性）。この問題に対する一般的な解決策は、単一のプーリング操作を使ってシーケンス長を短くすることである。本論文では、単一のプーリング操作によって抽出されたプールされた特徴があまり強力ではないと思われる既存の視覚変換器を改善する方法について考察する。そのために、ピラミッドプーリングは、その強力なコンテキスト抽象化能力により、様々な視覚タスクにおいて有効であることが実証されていることに注目する。しかし、ピラミッドプーリングは、バックボーンネットワーク設計において、これまで検討されてこなかった。このギャップを埋めるために、我々は、視覚変換器における多頭自己注意（MHSA）にピラミッドプーリングを適応させ、シーケンス長を短縮すると同時に、強力な文脈特徴を捕らえることを提案する。このプーリングに基づくMHSAと組み合わせて、我々はPyramid Pooling Transformer (P2T)と呼ばれる汎用ビジョン変換器基盤を構築する。P2Tを基幹ネットワークとして適用した場合、画像分類、意味分割、物体検出、インスタンス分割などの様々な視覚タスクにおいて、従来のCNNや変換器ベースのネットワークと比較して、実質的に優位性を示すことを広範な実験により実証している。コードは、https://github.com/yuhuan-wu/P2T で公開される予定です。

要約(オリジナル)

Recently, the vision transformer has achieved great success by pushing the state-of-the-art of various vision tasks. One of the most challenging problems in the vision transformer is that the large sequence length of image tokens leads to high computational cost (quadratic complexity). A popular solution to this problem is to use a single pooling operation to reduce the sequence length. This paper considers how to improve existing vision transformers where the pooled feature extracted by a single pooling operation seems less powerful. To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful ability in context abstraction. However, pyramid pooling has not been explored in backbone network design. To bridge this gap, we propose to adapt pyramid pooling to Multi-Head Self-Attention (MHSA) in the vision transformer, simultaneously reducing the sequence length and capturing powerful contextual features. Plugged with our pooling-based MHSA, we build a universal vision transformer backbone, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various vision tasks such as image classification, semantic segmentation, object detection, and instance segmentation, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T.

arxiv情報

著者	Yu-Huan Wu,Yun Liu,Xin Zhan,Ming-Ming Cheng
発行日	2022-08-05 07:54:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

P2T: Pyramid Pooling Transformer for Scene Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー