p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

要約

多様なタスクにわたるマルチモーダル大規模言語モデル (MLLM) の優れたパフォーマンスにもかかわらず、多大なトレーニングと推論のコストがその進歩を妨げています。
計算の大部分は、トランスフォーマーデコーダーによって処理される圧倒的な量のビジョントークンから生じます。
この論文では、Mixture-of-Depths (MoD) メカニズムを活用して、効率的な MLLM を構築することを提案します。このメカニズムでは、各トランスフォーマーデコーダー層が、冗長なビジョントークンをスキップしながら、処理する必須のビジョントークンを選択します。
ただし、MoD を MLLM に統合するのは簡単ではありません。
限られたトレーニングデータだけでなく、トレーニングと推論の安定性の課題に対処するために、タンゲート重み正規化 (TanhNorm) と対称トークン再重み付け (STRing) という 2 つの新しい設計で MoD モジュールを適応させます。
さらに、ビジョントークンがより深い層でより高い冗長性を示すことを観察し、シフトコサインスケジュールを使用して層ごとにトークン保持率を徐々に低下させる漸進的比率減衰（PRD）戦略を設計します。
この重要な設計は MoD の可能性を最大限に引き出し、モデルの効率とパフォーマンスを大幅に向上させます。
私たちのアプローチの有効性を検証するために、14 のベンチマークにわたって 2 つのベースラインモデルを使用して広範な実験を実施しました。
私たちのモデル p-MoD は、推論中の TFLOP が 55.6%、KV キャッシュストレージが 53.8%、トレーニング中の GPU 時間が 77.7% であり、ベースラインモデルのパフォーマンスと同等かそれを上回っています。

要約(オリジナル)

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

arxiv情報

著者	Jun Zhang,Desen Meng,Ji Qi,Zhenpeng Huang,Tao Wu,Limin Wang
発行日	2024-12-05 18:58:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー