Exploring Long-Sequence Masked Autoencoders

要約

Masked Autoencoding (MAE) は、複数のドメインにわたって表現を事前トレーニングするための効果的なアプローチとして登場しました。
自然言語の離散トークンとは対照的に、イメージ MAE の入力は連続的であり、追加の仕様に従います。
事前トレーニング段階で各入力仕様を体系的に調査し、シーケンスの長さが MAE をさらにスケーリングする重要な軸であることを発見しました。
私たちの研究は、マスクサイズをパッチサイズから切り離すだけで、元のレシピへの変更を最小限に抑えたMAEのロングシーケンスバージョンにつながります。
オブジェクト検出とセマンティックセグメンテーションでは、ロングシーケンス MAE は、転送中の追加の計算コストなしで、すべての実験セットアップで一貫したゲインを示します。
長いシーケンスの事前トレーニングは、検出とセグメンテーションに最も有益であると認識されていますが、標準の画像サイズを維持し、シーケンスの長さだけを増やすことで、ImageNet-1K 分類でも強力な結果を達成しています。
私たちの調査結果が、コンピュータービジョンのスケーリングに関する新しい洞察と手段を提供できることを願っています。

要約(オリジナル)

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains. In contrast to discrete tokens in natural languages, the input for image MAE is continuous and subject to additional specifications. We systematically study each input specification during the pre-training stage, and find sequence length is a key axis that further scales MAE. Our study leads to a long-sequence version of MAE with minimal changes to the original recipe, by just decoupling the mask size from the patch size. For object detection and semantic segmentation, our long-sequence MAE shows consistent gains across all the experimental setups without extra computation cost during the transfer. While long-sequence pre-training is discerned most beneficial for detection and segmentation, we also achieve strong results on ImageNet-1K classification by keeping a standard image size and only increasing the sequence length. We hope our findings can provide new insights and avenues for scaling in computer vision.

arxiv情報

著者	Ronghang Hu,Shoubhik Debnath,Saining Xie,Xinlei Chen
発行日	2022-10-13 17:50:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Long-Sequence Masked Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー