Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

要約

大規模なビジョンファウンデーションモデルは、自然画像の視覚タスクにおいて大きな進歩を遂げました。優れたスケーラビリティと表現能力により、ビジョントランスフォーマーが主な選択肢となっています。
ただし、リモートセンシング (RS) の大規模モデルはまだ十分に検討されていません。
この論文では、約 1 億個のパラメーターを持つ単純なビジョントランスフォーマーに頼り、RS タスクに合わせた大規模なビジョンモデルを提案し、そのような大規模なモデルがどのように機能するかを調査する最初の試みを行います。
RS 画像の大きなサイズと任意の向きのオブジェクトを処理するために、トランスフォーマーの元の完全な注意を置き換えるために、新しい回転可変サイズウィンドウの注意を提案します。
生成された多様なウィンドウからの豊富なコンテキスト。
検出タスクの実験では、DOTA-V1.0 データセットで 81.24% の mAP を達成し、すべての最先端のモデルに対するモデルの優位性が示されています。
ダウンストリームの分類およびセグメンテーションタスクに関するモデルの結果も、既存の高度な方法と比較して競争力のあるパフォーマンスを示しています。
さらなる実験により、計算の複雑さと転送時のデータ効率の点で、モデルの利点が示されます。

要約(オリジナル)

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

arxiv情報

著者	Di Wang,Qiming Zhang,Yufei Xu,Jing Zhang,Bo Du,Dacheng Tao,Liangpei Zhang
発行日	2022-12-08 13:51:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー