LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

要約

大規模マルチモーダルモデル (LMM) は、ビジュアルエンコーダーと大規模な言語モデルを接続することにより、重要な推論機能を示しました。
LMM は通常、CLIP ビジュアルエンコーダーの最後から 2 番目のレイヤー機能など、固定量のビジュアルトークンをプレフィックスコンテンツとして使用します。
最近の LMM には、高解像度の画像やビデオなど、より複雑な視覚入力が組み込まれており、視覚トークンの数が大幅に増加しています。
ただし、Transformer アーキテクチャの設計により、これらのモデルに関連する計算コストは、入力トークンの数に応じて二次関数的に増加する傾向があります。
この問題に取り組むために、私たちはトークン削減メカニズムを調査し、以前の研究と同様に、多くの視覚的トークンが空間的に冗長であることを発見しました。
これに基づいて、同等のモデルのパフォーマンスを維持しながらビジュアルトークンの数を大幅に削減する、新しい適応ビジュアルトークン削減アプローチである PruMerge を提案します。
まず、クラストークンおよび空間トークンとの類似性に基づいて、枝刈りされていないビジュアルトークンを選択します。
次に、キーの類似性に基づいてプルーニングされたトークンをクラスター化し、クラスター化されたトークンをプルーニングされていないトークンとマージして情報を補足します。
経験的に、LLaVA-1.5 に適用すると、私たちのアプローチはビジュアルトークンを平均 18 倍圧縮でき、さまざまなビジュアルな質問応答および推論タスクにわたって同等のパフォーマンスを達成できます。
コードとチェックポイントは https://llava-prumerge.github.io/ にあります。

要約(オリジナル)

Large Multimodal Models (LMMs) have shown significant reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically use a fixed amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which increase the number of visual tokens significantly. However, due to the design of the Transformer architecture, computational costs associated with these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism and find, similar to prior work, that many visual tokens are spatially redundant. Based on this, we propose PruMerge, a novel adaptive visual token reduction approach, which largely reduces the number of visual tokens while maintaining comparable model performance. We first select the unpruned visual tokens based on their similarity to class tokens and spatial tokens. We then cluster the pruned tokens based on key similarity and merge the clustered tokens with the unpruned tokens to supplement their information. Empirically, when applied to LLaVA-1.5, our approach can compress the visual tokens by 18 times on average, and achieve comparable performance across diverse visual question-answering and reasoning tasks. Code and checkpoints are at https://llava-prumerge.github.io/.

arxiv情報

著者	Yuzhang Shang,Mu Cai,Bingxin Xu,Yong Jae Lee,Yan Yan
発行日	2024-04-12 17:34:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー