M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

要約

マルチタスク学習 (MTL) は、複数の学習済みタスクを 1 つのモデルにカプセル化し、多くの場合、それらのタスクをより適切に共同学習させます。
ただし、多くの場合、リソースに制約があったり、レイテンシーに敏感な実世界のシステムに MTL を展開する場合、2 つの顕著な課題が生じます。
(ii) 推論では、現在の MTL レジームは、単一のタスクを実行するだけでもほぼモデル全体をアクティブ化する必要があります。
しかし、ほとんどの実際のシステムは、各瞬間に 1 つまたは 2 つのタスクしか要求せず、必要に応じてタスクを切り替えます。したがって、このようなすべてのタスクをアクティブ化する推論も、非常に非効率的でスケーラブルではありません。
このホワイトペーパーでは、効率的なオンデバイス MTL を可能にするモデルアクセラレータの協調設計フレームワークを紹介します。
M$^3$ViT と呼ばれる当社のフレームワークは、Mixed-of-Experts (MoE) レイヤーを MTL の Vision Transformer (ViT) バックボーンにカスタマイズし、トレーニング中にタスク固有のエキスパートをまばらにアクティブ化します。
次に、関心のあるタスクとの推論で、同じ設計により、完全なモデルではなく、タスクに対応する疎なエキスパート経路のみをアクティブ化できます。
私たちの新しいモデル設計は、ハードウェアレベルの技術革新、特に、タスク間のゼロオーバーヘッドスイッチングを実現し、任意の数のエキスパートにスケーリングできる、メモリに制約のある MTL 用に調整された新しい計算並べ替えスキームによってさらに強化されています。
単一タスクの推論を実行する場合、M$^{3}$ViT はエンコーダ中心の MTL メソッドよりも高い精度を達成し、88% の推論 FLOP を大幅に削減します。
1 つのザイリンクス ZCU104 FPGA のハードウェアプラットフォームに実装された場合、当社の協調設計フレームワークはメモリ要件を 2.4 倍削減し、同等の FPGA ベースラインよりも最大 9.23 倍高いエネルギー効率を達成します。
コードは https://github.com/VITA-Group/M3ViT で入手できます。

要約(オリジナル)

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT.

arxiv情報

著者	Hanxue Liang,Zhiwen Fan,Rishov Sarkar,Ziyu Jiang,Tianlong Chen,Kai Zou,Yu Cheng,Cong Hao,Zhangyang Wang
発行日	2022-10-26 15:40:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー