Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

要約

マルチモーダル大規模言語モデル (MLLM) は、ビジュアルエンコーダからのビジュアルトークンをテキストトークンとして扱うことにより、大規模言語モデル (LLM) の堅牢なアーキテクチャを活用して、さまざまな視覚理解タスクにわたって目覚ましい進歩を達成しました。
ただし、トークン数が増加するにつれて、LLM での計算の 2 次スケーリングにより効率性の大幅なボトルネックが生じ、さらなるスケーラビリティが妨げられます。
最近のアプローチでは、ビジュアルトークンのプルーニングや軽量の LLM アーキテクチャの採用が検討されていますが、ビジュアルトークンの増加による計算オーバーヘッドは依然として大きな課題です。
この研究では、代表的な MLLM である LLaVA 内のパラメータレベルと計算パターンレベルの両方で視覚的計算の冗長性を調査し、効率を高めるための一連の合理化された戦略を導入します。
これらには、近隣認識ビジュアルトークンアテンション、非アクティブなビジュアルアテンションヘッドのプルーニング、ビジュアル計算のための選択的なレイヤーのドロップが含まれます。
これらの戦略を LLaVA に実装することで、主要なベンチマーク全体でモデルのパフォーマンスを維持しながら、計算需要の 88% 削減を達成します。
さらに、Qwen2-VL-7B や InternVL-2.0-4B/8B/26B などの他の MLLM における視覚的な計算冗長性の存在を検証します。
これらの結果は、MLLM が最小限の計算コストで高密度のビジュアルトークンを処理するための新しい経路を提示します。
さらなる研究をサポートするために、コードとモデルのチェックポイントがリリースされる予定です。

要約(オリジナル)

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language Models (LLMs). However, as token counts grow, the quadratic scaling of computation in LLMs introduces a significant efficiency bottleneck, impeding further scalability. Although recent approaches have explored pruning visual tokens or employing lighter LLM architectures, the computational overhead from an increasing number of visual tokens remains a substantial challenge. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA, a representative MLLM, and introduce a suite of streamlined strategies to enhance efficiency. These include neighbor-aware visual token attention, pruning of inactive visual attention heads, and selective layer dropping for visual computations. By implementing these strategies in LLaVA, we achieve a reduction in computational demands of 88% while maintaining model performance across key benchmarks. Additionally, we validate the existence of visual computational redundancy in other MLLMs, such as Qwen2-VL-7B and InternVL-2.0-4B/8B/26B. These results present a novel pathway for MLLMs to handle dense visual tokens with minimal computational costs. Code and model checkpoints will be released to support further research.

arxiv情報

著者	Zeliang Zhang,Phu Pham,Wentian Zhao,Kun Wan,Yu-Jhe Li,Jianing Zhou,Daniel Miranda,Ajinkya Kale,Chenliang Xu
発行日	2024-11-15 18:43:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー