Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

要約

コンピュータビジョンモデルで処理する前に画像のサイズを固定解像度に変更するという、遍在的かつ明らかに次善の選択に対する挑戦はまだ成功していない。
ただし、Vision Transformer (ViT) などのモデルは、柔軟なシーケンスベースのモデリングを提供するため、入力シーケンスの長さを変更できます。
これを NaViT (ネイティブ解像度 ViT) で利用します。これは、トレーニング中にシーケンスパッキングを使用して、任意の解像度とアスペクト比の入力を処理します。
柔軟なモデルの使用に加えて、大規模な教師あり対比画像テキスト事前トレーニングのトレーニング効率の向上を実証します。
NaViT は、画像やビデオの分類、オブジェクト検出、セマンティックセグメンテーションなどの標準タスクに効率的に移行でき、堅牢性と公平性のベンチマークの結果の向上につながります。
推論時には、入力解像度の柔軟性を利用して、テスト時のコストとパフォーマンスのトレードオフをスムーズに行うことができます。
私たちは、NaViT が、ほとんどのコンピュータービジョンモデルで使用される、CNN によって設計された標準的な入力およびモデリングパイプラインからの脱却を示し、ViT にとって有望な方向性を示すものであると信じています。

要約(オリジナル)

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

arxiv情報

著者	Mostafa Dehghani,Basil Mustafa,Josip Djolonga,Jonathan Heek,Matthias Minderer,Mathilde Caron,Andreas Steiner,Joan Puigcerver,Robert Geirhos,Ibrahim Alabdulmohsin,Avital Oliver,Piotr Padlewski,Alexey Gritsenko,Mario Lučić,Neil Houlsby
発行日	2023-07-12 17:01:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー