Mixture of Nested Experts: Adaptive Processing of Visual Tokens

要約

視覚媒体 (画像やビデオ) には当然ながら大量の情報の冗長性が含まれているため、処理効率を活用する大きな機会となります。
Vision Transformer (ViT) ベースのモデルは大規模なデータ領域に効果的に拡張できますが、この固有の冗長性を活用できず、計算コストの増加につながります。
Mixture of Experts (MoE) ネットワークは、同じ推論時間コストを維持しながらスケーラビリティを実証しますが、より大きなパラメーターフットプリントが伴います。
我々は、エキスパートのネスト構造を利用する、Mixture of Nested Experts (MoNE) を提案します。ここで、個々のエキスパートは、増加する計算精度曲線に当てはまります。
コンピューティングバジェットが与えられると、MoNE は優先順位に従ってトークンを動的に選択することを学習するため、冗長なトークンは安価なネストされたエキスパートを通じて処理されます。
このフレームワークを使用すると、ベースラインモデルと同等のパフォーマンスを達成しながら、推論の計算時間を 2 倍以上削減できます。
標準的な画像およびビデオデータセット (ImageNet-21K、Kinetics400、Something-Something-v2) に対するアプローチを検証します。
さらに、単一のトレーニング済みモデルのみを使用して、ビデオ上のさまざまな推論時間の計算予算にわたって強力なパフォーマンスを維持する能力を示すことで、MoNE$ の適応性を強調します。

要約(オリジナル)

The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets – ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$’$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.

arxiv情報

著者	Gagan Jain,Nidhi Hegde,Aditya Kusupati,Arsha Nagrani,Shyamal Buch,Prateek Jain,Anurag Arnab,Sujoy Paul
発行日	2024-07-30 17:26:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー