Learning Free Token Reduction for Multi-Modal Large Language Models

要約

ビジョン言語モデル（VLM）は、さまざまなマルチモーダルタスクで顕著な成功を収めています。
ただし、それらの実際の展開は、多くの場合、高い計算コストと長時間の推論時間によって制約されます。
ビジョンモダリティは通常、テキストモダリティよりも多くの情報を提供するため、視覚的なプロンプトを圧縮することで、これらの課題を軽減するための有望なソリューションが提供されます。
既存のアプローチは、主にモデルアーキテクチャの改良または視覚トークンの数を直接削減することに焦点を当てています。
ただし、これらの方法は、視覚データのユニークな空間的および時間的特性を考慮していないため、しばしば推論のパフォーマンスを妥協します。
この作業では、空間的および時間的次元の両方で動作するトークン圧縮パラダイムを提案します。
私たちのアプローチには、ほとんどのマルチモーダル大手言語モデル（MLLM）フレームワークにシームレスに統合できる、学習フリーのプラグアンドプレイ圧縮パイプラインが含まれています。
この方法を活用することにより、モデル推論機能を強化し、同時に計算コストを削減します。
ビデオ-QAタスクの実験結果は、提案されたアプローチの有効性を示しており、パフォーマンスを犠牲にすることなく効率の大幅な改善を示しています。

要約(オリジナル)

Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks; however, their practical deployment is often constrained by high computational costs and prolonged inference times. Since the vision modality typically carries more information than the text modality, compressing visual prompts offers a promising solution to alleviate these challenges. Existing approaches predominantly focus on refining model architectures or directly reducing the number of visual tokens. However, these methods often compromise inference performance due to a lack of consideration for the unique spatial and temporal characteristics of visual data. In this work, we propose a token compression paradigm that operates on both spatial and temporal dimensions. Our approach includes a learning-free, plug-and-play compression pipeline that can be seamlessly integrated into most Multimodal Large Language Model (MLLM) frameworks. By leveraging this method, we enhance the model inference capability while simultaneously reducing its computational cost. Experimental results on the Video-QA task demonstrate the effectiveness of the proposed approach, showcasing significant improvements in efficiency without sacrificing performance.

arxiv情報

著者	Zihui Zhao,Yingxin Li,Yang Li
発行日	2025-04-14 17:34:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Free Token Reduction for Multi-Modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー