Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

要約

マルチモーダル大規模言語モデル (MLLM) では、視覚的な情報表現に必要な膨大なパラメータと追加の入力トークンにより、推論にかなりの計算が必要になります。
ここでは、MLLM を強化して迅速な推論を実現するプラグアンドプレイモジュールである Visual Tokens Withdrawal (VTW) を紹介します。
私たちのアプローチは、私たちが観察した 2 つの興味深い現象に触発されています。(1) LLM で蔓延している注意シンク現象は MLLM にも存続しており、最初のトークンと最も近いトークンが大部分の注意を受け取るのに対し、中間ビジョンのトークンは最小限の注意しか集めないことを示唆しています。
深い層。
(2) 情報移行の存在。これは、視覚情報が MLLM の最初の数層内の後続のテキストトークンに転送されることを意味します。
私たちの調査結果によれば、MLLM の深い層ではビジョントークンは必要ないと結論付けています。
したがって、特定のレイヤーで戦略的にそれらを撤回し、テキストトークンのみが後続のレイヤーに関与できるようにします。
ビジョントークンの引き出しに理想的なレイヤーを特定するために、最初に限られた小さなデータセットのセットを分析し、カルバック-ライブラーの発散基準を満たす最初のレイヤーを選択します。
当社の VTW アプローチは、パフォーマンスを維持しながら、さまざまなマルチモーダルタスクにわたって計算オーバーヘッドを 40\% 以上削減できます。
私たちのコードは https://github.com/lzhxmu/VTW でリリースされています。

要約(オリジナル)

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are not necessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for vision tokens withdrawal, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance. Our code is released at https://github.com/lzhxmu/VTW.

arxiv情報

著者	Zhihang Lin,Mingbao Lin,Luxi Lin,Rongrong Ji
発行日	2024-05-09 14:38:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー