Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

要約

既存の Multimoal Large Language Model (MLLM) でのビジュアルトークンの過度の使用は、多くの場合、明らかな冗長性を示し、法外に高価な計算をもたらします。
この問題についての洞察を得るために、まず MLLM の注意行動に関する広範な実証研究を実施し、MLLM の 3 つの主要な推論段階を要約します。 (i) トークン間の初期融合は、まず迅速に達成されます。
(ii) 次に、モダリティ内モデリングが機能します。
(iii) マルチモーダル推論} が再開され、推論が終了するまで続きます。
特に、テキストトークンが十分な画像情報を受け取ると、視覚的トークンが推論に寄与しなくなり、明らかな視覚的冗長性が得られることを明らかにしました。
これらの一般化された観察に基づいて、動的ビジュアルトークン出口 (DyVTE) と呼ばれる、MLLM の効率を向上させるシンプルかつ効果的な方法を提案します。
DyVTE は、軽量のハイパーネットワークを使用してテキストトークンのステータスを認識し、特定のレイヤー以降のすべてのビジュアルトークンの削除を決定し、それによって観察されたビジュアルの冗長性に対処します。
VTE を検証するために、LLaVA、VILA、Eagle、InternVL を含む一連の MLLM に VTE を適用し、多数のベンチマークで広範な実験を実施します。
実験結果は、MLLM の効率向上における VTE の有効性を示すだけでなく、MLLM の一般的なモデリングパターンも得、MLLM の深い理解を容易にします。
私たちのコードは https://github.com/DoubtedSteam/DyVTE で匿名で公開されています。

要約(オリジナル)

The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to perceive the text token status and decide the removal of all visual tokens after a certain layer, thereby addressing the observed visual redundancy. To validate VTE, we apply it to a set of MLLMs, including LLaVA, VILA, Eagle and InternVL, and conduct extensive experiments on a bunch of benchmarks. The experiment results not only show the effectiveness of our VTE in improving MLLMs’ efficiency, but also yield the general modeling patterns of MLLMs, well facilitating the in-depth understanding of MLLMs. Our code is anonymously released at https://github.com/DoubtedSteam/DyVTE.

arxiv情報

著者	Qiong Wu,Wenhao Lin,Weihao Ye,Yiyi Zhou,Xiaoshuai Sun,Rongrong Ji
発行日	2024-11-29 11:24:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー