Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

要約

スパース専門家混合モデル (MoE) は、特定の入力トークンに対してモデルパラメーターの小さなサブセットのみをアクティブにすることでモデルサイズを推論効率から切り離す機能により、最近人気が高まっています。
このように、スパース MoE は前例のないスケーラビリティを可能にし、自然言語処理やコンピュータービジョンなどの分野で多大な成功を収めています。
この研究では、代わりに、まばらな MoE を使用してビジョントランスフォーマー (ViT) をスケールダウンし、リソースに制約のあるビジョンアプリケーションにとって魅力的なものにすることを検討します。
この目的を達成するために、個々のパッチではなくイメージ全体が専門家にルーティングされる、簡素化されたモバイルフレンドリーな MoE 設計を提案します。
また、ルータをガイドするためにスーパークラスの情報を使用する安定した MoE トレーニング手順も提案します。
我々は、疎なモバイルビジョン MoE (V-MoE) が、対応する密な ViT よりもパフォーマンスと効率の間で優れたトレードオフを達成できることを経験的に示しています。
たとえば、ViT-Tiny モデルの場合、当社のモバイル V-MoE は、ImageNet-1k 上で高密度の対応モデルよりも 3.39% 優れたパフォーマンスを発揮します。
推論コストがわずか 5,400 万 FLOP のさらに小さい ViT バリアントの場合、MoE は 4.66% の改善を達成します。

要約(オリジナル)

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

arxiv情報

著者	Erik Daxberger,Floris Weers,Bowen Zhang,Tom Gunter,Ruoming Pang,Marcin Eichner,Michael Emmersberger,Yinfei Yang,Alexander Toshev,Xianzhi Du
発行日	2023-09-08 14:24:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー