CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

要約

マルチモーダル大規模言語モデル (LLM) の最近の進歩は、テキストと画像のペアのデータを増やし、LLM を強化してマルチモーダルタスクのパフォーマンスを向上させることによるスケーリングに主に焦点を当てています。
ただし、これらのスケーリング手法は計算コストが高く、ビジョン側からモデルの機能を向上させる重要性が見落とされています。
LLM における Mixture-of-Experts (MoE) の成功したアプリケーションに触発され、小規模なモデルと同様の推論コストを維持しながらトレーニング中のモデルのスケーラビリティを向上させる、CuMo を提案します。
CuMo は、Co-upcycled Top-K sparsely-gated Mixture-of-Experts ブロックをビジョンエンコーダと MLP コネクタの両方に組み込んでおり、それにより、推論中に追加の有効化パラメータを最小限に抑えてマルチモーダル LLM を強化します。
CuMo は、まず MLP ブロックを事前トレーニングし、次に視覚的命令の調整段階で、事前トレーニングされた MLP ブロックから MoE ブロック内の各エキスパートを初期化します。
補助損失は、専門家の負荷のバランスを確保するために使用されます。
CuMo は、オープンソースのデータセットのみでトレーニングしながら、各モデルサイズグループ内のモデルを使用したさまざまな VQA および視覚的指示に従うベンチマーク全体で、最先端のマルチモーダル LLM よりも優れたパフォーマンスを発揮します。
CuMo のコードとモデルの重みは、https://github.com/SHI-Labs/CuMo でオープンソース化されています。

要約(オリジナル)

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.

arxiv情報

著者	Jiachen Li,Xinyao Wang,Sijie Zhu,Chia-Wen Kuo,Lu Xu,Fan Chen,Jitesh Jain,Humphrey Shi,Longyin Wen
発行日	2024-05-09 17:37:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー