Improving Multi-modal Large Language Model through Boosting Vision Capabilities

要約

私たちは、視覚言語モデルを強化するための視覚理解能力の向上に焦点を当てています。
私たちは、2 つの重要なテクニックを導入したマルチモーダル言語モデル \textbf{Arcana} を提案します。
まず、デコーダを強化するために設計されたモジュールである Multimodal LoRA (MM-LoRA) を紹介します。
従来の言語駆動型デコーダーとは異なり、MM-LoRA は 2 つの並列 LoRA (視覚用と言語用に 1 つ) で構成されており、それぞれ独自のパラメーターがあります。
この解きほぐされたパラメータ設計により、各モダリティでのより専門的な学習と、マルチモーダル情報のより適切な統合が可能になります。
次に、ビジュアルエンコーダーを改善するために、クエリラダーアダプター (QLadder) を導入します。
QLadder は、学習可能な「\textit{ladder}」構造を採用して、フリーズされた事前トレーニングされたビジュアルエンコーダー (CLIP 画像エンコーダーなど) からの中間表現を深く集約します。
これにより、モデルは、事前トレーニングされたビジュアルエンコーダーの強力な機能を維持しながら、新しく有益なビジュアル機能を学習できるようになります。
これらの技術を組み合わせることで、Arcana の視覚認識能力が向上し、改善された視覚情報を活用して、さまざまなマルチモーダルシナリオ全体でより正確でコンテキストに関連した出力を実現できるようになります。
広範な実験とアブレーション研究により、アルカナの有効性と一般化能力が実証されています。
コードと再アノテーションが付けられたデータは \url{https://arcana-project-page.github.io} で入手できます。

要約(オリジナル)

We focus on improving the visual understanding capability for boosting the vision-language models. We propose \textbf{Arcana}, a multiModal language model, which introduces two crucial techniques. First, we present Multimodal LoRA (MM-LoRA), a module designed to enhance the decoder. Unlike traditional language-driven decoders, MM-LoRA consists of two parallel LoRAs — one for vision and one for language — each with its own parameters. This disentangled parameters design allows for more specialized learning in each modality and better integration of multimodal information. Second, we introduce the Query Ladder adapter (QLadder) to improve the visual encoder. QLadder employs a learnable “\textit{ladder}” structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e.g., CLIP image encoder). This enables the model to learn new and informative visual features, as well as remaining the powerful capabilities of the pretrained visual encoder. These techniques collectively enhance Arcana’s visual perception power, enabling it to leverage improved visual information for more accurate and contextually relevant outputs across various multimodal scenarios. Extensive experiments and ablation studies demonstrate the effectiveness and generalization capability of our Arcana. The code and re-annotated data are available at \url{https://arcana-project-page.github.io}.

arxiv情報

著者	Yanpeng Sun,Huaxin Zhang,Qiang Chen,Xinyu Zhang,Nong Sang,Gang Zhang,Jingdong Wang,Zechao Li
発行日	2024-10-17 16:36:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー