Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

要約

ビジョン言語モデル（VLM）は視覚的な推論に優れていますが、多くの場合、高い計算コストが発生します。
重要な理由の1つは、視覚トークンの冗長性です。
最近のトークン削減方法は、パフォーマンスの低下を最小限に抑えると主張していますが、広範な実験では、トークンの削減がモデルの出力分布を大幅に変更し、精度損失などの標準的なメトリックが完全にはキャプチャしないという予測パターンの変化につながることが明らかになりました。
このような矛盾は、システムの安定性が重要である実用的なアプリケーションの場合、特に懸念されます。
この現象を調査するために、トークンの減少が、特異値分解（SVD）を介した低ランク近似を使用して、VLMの内部表現のエネルギー分布にどのように影響するかを分析します。
我々の結果は、特異値スペクトルの逆参加比の変化が、トークンの削減後のモデルの一貫性と強く相関していることを示しています。
これらの洞察に基づいて、トークンプルーニングにSVDのレバレッジスコアを利用するトレーニングフリーの視覚トークン削減方法であるLofiを提案します。
実験的評価は、LOFIがパフォーマンスの低下で計算コストを削減するだけでなく、出力の一貫性の観点から最先端の方法を大幅に上回ることを示しています。

要約(オリジナル)

Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model’s output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM’s internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model’s consistency after token reduction. Based on these insights, we propose LoFi–a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.

arxiv情報

著者	Yizheng Sun,Hao Li,Chang Xu,Chenghua Lin,Riza Batista-Navarro,Jingyuan Sun
発行日	2025-03-11 14:34:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー