PruneVid: Visual Token Pruning for Efficient Video Large Language Models

要約

このペーパーでは、マルチモーダルビデオの理解効率を高めるために設計された視覚的なトークンプルーニング手法である PruneVid を紹介します。
大規模言語モデル (LLM) は、視覚的なモダリティを理解する拡張機能により、ビデオタスクで有望なパフォーマンスを示しています。
ただし、ビデオデータの大幅な冗長性により、LLM にとって計算上の大きな課題が生じます。
この問題に対処するために、1) 時空間トークンをマージすることでビデオの冗長性を最小限に抑え、2) LLM の推論機能を活用して質問トークンに関連する視覚的特徴を選択的に取り除き、モデルの効率を高めるトレーニング不要の方法を導入します。
私たちは複数のビデオベンチマークにわたってメソッドを検証し、PruneVid がさまざまなモデルネットワークと組み合わせて競争力のあるパフォーマンスを維持しながら、トークンの 80% 以上をプルーニングできることを実証しました。
これは、既存の剪定方法と比較して、その優れた有効性と効率性を強調しています。
コード: https://github.com/Visual-AI/PruneVid。

要約(オリジナル)

In this paper, we introduce PruneVid, a visual token pruning method designed to enhance the efficiency of multi-modal video understanding. Large Language Models (LLMs) have shown promising performance in video tasks due to their extended capabilities in comprehending visual modalities. However, the substantial redundancy in video data presents significant computational challenges for LLMs. To address this issue, we introduce a training-free method that 1) minimizes video redundancy by merging spatial-temporal tokens, and 2) leverages LLMs’ reasoning capabilities to selectively prune visual features relevant to question tokens, enhancing model efficiency. We validate our method across multiple video benchmarks, which demonstrate that PruneVid can prune over 80% of tokens while maintaining competitive performance combined with different model networks. This highlights its superior effectiveness and efficiency compared to existing pruning methods. Code: https://github.com/Visual-AI/PruneVid.

arxiv情報

著者	Xiaohu Huang,Hao Zhou,Kai Han
発行日	2024-12-20 18:01:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー