Efficient Multi-modal Large Language Models via Visual Token Grouping

要約

マルチモーダル大規模言語モデル (MLLM) の開発により、大規模言語モデル (LLM) がテキストを超えたデータ形式を認識できる機能が強化され、視覚的な質問応答や画像キャプションなど、さまざまな下流アプリケーションが大幅に進歩します。
ただし、高解像度の画像やビデオの処理に伴う膨大な計算コストが、より広範な採用の障壁となっています。
この課題に対処するために、MLLM でビジョントークンを圧縮することが、推論コストを削減する有望なアプローチとして浮上しました。
既存の方法では、機能調整フェーズでトークン削減が行われます。
この論文では、事前にトレーニングされたビジョンエンコーダの機能を利用して、セグメンテーションマスクを必要とせずに類似した画像セグメントをグループ化する新しいグループ化メカニズムである VisToG を紹介します。
具体的には、ビジョンエンコーダに入力する前に、線形投影層の後でセマンティックトークンを連結して画像セマンティックセグメントを表します。
さらに、VisToG は、私たちが採用する分離された注意により、事前トレーニングされたビジョンエンコーダーの事前知識を利用して、冗長なビジュアルトークンを特定して削除することができ、計算需要を効果的に削減します。
広範な実験により、VisToG の有効性が実証され、元のパフォーマンスの 98.1% を維持しながら、推論時間の 27\% 以上の削減を達成しました。

要約(オリジナル)

The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.

arxiv情報

著者	Minbin Huang,Runhui Huang,Han Shi,Yimeng Chen,Chuanyang Zheng,Xiangguo Sun,Xin Jiang,Zhenguo Li,Hong Cheng
発行日	2024-12-02 14:55:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Multi-modal Large Language Models via Visual Token Grouping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー