Visual Perception by Large Language Model’s Weights

要約

既存のマルチモーダル大規模言語モデル (MLLM) は、視覚特徴を大規模言語モデル (LLM) の入力空間に合わせ、視覚トークンとテキストトークンを連結して LLM の統一されたシーケンス入力を形成することによって、視覚情報を認識するパラダイムに従っています。
これらの方法は、さまざまな視覚言語タスクで有望な結果を示しますが、視覚トークンの関与によって生じる入力シーケンスの拡張による高い計算量によって制限されます。
この論文では、入力空間アライメントの代わりに、視覚情報をモデルの重みとして表す新しいパラメータ空間アライメントパラダイムを提案します。
各入力画像に対して、ビジョンエンコーダを使用して視覚的特徴を抽出し、特徴を知覚的な重みに変換し、知覚的な重みを LLM の重みとマージします。
このように、LLM の入力にはビジュアルトークンが必要ないため、入力シーケンスの長さが短縮され、効率が大幅に向上します。
このパラダイムに従って、知覚重みジェネレーターを備えた VLoRA を提案します。
知覚重みジェネレーターは、視覚的特徴を低ランク特性を持つ知覚重みに変換するように設計されており、LoRA と同様の形式を示します。
実験結果は、VLoRA がトレーニングと推論の両方にかかる計算コストを大幅に削減しながら、MLLM のさまざまなベンチマークで同等のパフォーマンスを達成することを示しています。
コードとモデルはオープンソース化されます。

要約(オリジナル)

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

arxiv情報

著者	Feipeng Ma,Hongwei Xue,Guangting Wang,Yizhou Zhou,Fengyun Rao,Shilin Yan,Yueyi Zhang,Siying Wu,Mike Zheng Shou,Xiaoyan Sun
発行日	2024-05-30 17:59:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Perception by Large Language Model’s Weights

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー