ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

要約

高解像度の大規模マルチモーダルモデル (LMM) は、過剰なビジュアルトークンと二次ビジュアルの複雑さという課題に直面します。
現在の高解像度 LMM は、過剰なビジュアルトークンを生成しながら、二次関数の複雑さに対処しています。
ただし、ビジュアルトークンの冗長性は、より実質的なコンピューティングにつながるため、重要な問題です。
この問題を軽減するために、Vision Transformer (ViT) に代わる LMM のビジュアルエンコーダーとして階層バックボーンである ConvNeXt を採用する ConvLLaVA を提案します。
ConvLLaVA は、高解像度の画像を情報豊富な視覚特徴に圧縮し、過剰な視覚トークンの生成を効果的に防ぎます。
ConvLLaVA の機能を強化するために、2 つの重要な最適化を提案します。
低解像度の事前トレーニング済み ConvNeXt は、高解像度に直接適用するとパフォーマンスが低下するため、ギャップを埋めるために更新します。
さらに、ConvNeXt の元の圧縮率ははるかに高い解像度の入力には不十分であるため、後続のステージをトレーニングしてビジュアルトークンをさらに圧縮し、それによって冗長性を削減します。
これらの最適化により、ConvLLaVA は 576 個のビジュアルトークンのみを生成する 1536×1536 解像度の入力をサポートし、任意のアスペクト比の画像を処理できるようになります。
実験結果は、私たちの手法が主流のベンチマークで最先端のモデルと競合するパフォーマンスを達成していることを示しています。
ConvLLaVA モデルシリーズは、https://github.com/alibaba/conv-llava で公開されています。

要約(オリジナル)

High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt’s original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536×1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at https://github.com/alibaba/conv-llava.

arxiv情報

著者	Chunjiang Ge,Sijie Cheng,Ziming Wang,Jiale Yuan,Yuan Gao,Jun Song,Shiji Song,Gao Huang,Bo Zheng
発行日	2024-05-24 17:34:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー