Scaling Vision Transformers to 22 Billion Parameters

要約

Transformers のスケーリングは、言語モデルの画期的な機能を推進してきました。
現在、最大の大規模言語モデル (LLM) には、100B 以上のパラメーターが含まれています。
ビジョントランスフォーマー (ViT) は、画像とビデオのモデリングに同じアーキテクチャを導入しましたが、これらはまだほぼ同じ程度にうまくスケーリングされていません。
最大の密な ViT には 4B パラメータが含まれます (Chen et al., 2022)。
22B パラメーター ViT (ViT-22B) の高効率で安定したトレーニングのレシピを提示し、結果のモデルでさまざまな実験を実行します。
ViT-22B は、ダウンストリームタスクで評価すると (多くの場合、凍結されたフィーチャの軽量線形モデルを使用して)、スケールに応じてパフォーマンスが向上することが示されます。
さらに、公平性とパフォーマンスの間のトレードオフの改善、形状/テクスチャのバイアスに関する人間の視覚への最先端の調整、および堅牢性の向上など、スケールの他の興味深い利点を観察します。
ViT-22B は、視覚における「LLM のような」スケーリングの可能性を実証し、そこに到達するための重要なステップを提供します。

要約(オリジナル)

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for ‘LLM-like’ scaling in vision, and provides key steps towards getting there.

arxiv情報

著者	Mostafa Dehghani,Josip Djolonga,Basil Mustafa,Piotr Padlewski,Jonathan Heek,Justin Gilmer,Andreas Steiner,Mathilde Caron,Robert Geirhos,Ibrahim Alabdulmohsin,Rodolphe Jenatton,Lucas Beyer,Michael Tschannen,Anurag Arnab,Xiao Wang,Carlos Riquelme,Matthias Minderer,Joan Puigcerver,Utku Evci,Manoj Kumar,Sjoerd van Steenkiste,Gamaleldin F. Elsayed,Aravindh Mahendran,Fisher Yu,Avital Oliver,Fantine Huot,Jasmijn Bastings,Mark Patrick Collier,Alexey Gritsenko,Vighnesh Birodkar,Cristina Vasconcelos,Yi Tay,Thomas Mensink,Alexander Kolesnikov,Filip Pavetić,Dustin Tran,Thomas Kipf,Mario Lučić,Xiaohua Zhai,Daniel Keysers,Jeremiah Harmsen,Neil Houlsby
発行日	2023-02-10 18:58:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Vision Transformers to 22 Billion Parameters

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー