VisionZip: Longer is Better but Not Necessary in Vision Language Models

要約

ビジョン言語モデルの最近の進歩により、ビジュアルトークンの長さがテキストトークンよりもはるかに長くなり、計算コストが大幅に増加することでパフォーマンスが向上しました。
ただし、CLIP や SigLIP などの一般的なビジョンエンコーダによって生成されたビジュアルトークンには、重大な冗長性が含まれていることがわかります。
これに対処するために、言語モデルへの入力用に有益なトークンのセットを選択するシンプルかつ効果的な方法である VisionZip を導入します。これにより、視覚的なトークンの冗長性が削減され、モデルのパフォーマンスを維持しながら効率が向上します。
提案された VisionZip は、画像やビデオの理解タスクに広く適用でき、以前の方法ではパフォーマンスが低下する傾向にあった現実世界のシナリオでのマルチターン対話に適しています。
実験結果では、VisionZip は、ほぼすべての設定において、以前の最先端の方法よりも少なくとも 5% のパフォーマンス向上を示しています。
さらに、私たちの手法はモデルの推論速度を大幅に向上させ、プリフィル時間を 8 倍改善し、LLaVA-Next 13B モデルが LLaVA-Next 7B モデルよりも高速に推論できると同時に、より良い結果を達成できるようになります。
さらに、私たちはこの冗長性の原因を分析し、単にトークンの長さを増やすのではなく、より優れた視覚的特徴を抽出することに焦点を当てることをコミュニティに奨励します。
私たちのコードは https://github.com/dvlab-research/VisionZip で入手できます。

要約(オリジナル)

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

arxiv情報

著者	Senqiao Yang,Yukang Chen,Zhuotao Tian,Chengyao Wang,Jingyao Li,Bei Yu,Jiaya Jia
発行日	2024-12-05 18:59:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VisionZip: Longer is Better but Not Necessary in Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー