LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

要約

大規模言語モデル (LLM) の成功により、研究者は統合された視覚的および言語的理解を目的としたマルチモーダル大規模言語モデル (MLLM) を探索するようになりました。
ただし、MLLM のモデルサイズと計算の複雑さの増大により、リソースに制約のある環境での使用は制限されます。
小規模 MLLM (s-MLLM) は、大規模モデル (l-MLLM) の機能を保持しながら計算要求を削減することを目的としていますが、パフォーマンスが大幅に低下します。
前述の問題に対処するために、l-MLLM から s-MLLM に知識を転送するための新しい LLaVA-KD フレームワークを提案します。
具体的には、l-MLLM と s-MLLM の視覚とテキストの出力分布間の乖離を最小限に抑えるためのマルチモーダル蒸留 (MDist) と、視覚的特徴間の相関関係をモデル化する l-MLLM の機能を伝達するための関係蒸留 (RDist) を導入します。
さらに、s-MLLM の可能性を最大限に活用するための 3 段階のトレーニングスキームを提案します。1) 視覚的テキスト表現を調整するための抽出された事前トレーニング、2) モデルにマルチモーダルな理解を備えるための教師あり微調整、および 3)
l-MLLM 機能をさらに継承するための精密な調整。
私たちのアプローチは、小規模モデルのアーキテクチャを変更することなく、パフォーマンスを大幅に向上させます。
広範な実験とアブレーション研究により、提案された各コンポーネントの有効性が検証されています。
コードは https://github.com/caiyuxuan1120/LLaVA-KD で入手できます。

要約(オリジナル)

The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM (s-MLLM) aims to retain the capabilities of the large-scale model (l-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM, and Relation Distillation (RDist) to transfer l-MLLM’s ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of s-MLLM: 1) Distilled Pre-Training to align visual-textual representations, 2) Supervised Fine-Tuning to equip the model with multimodal understanding, and 3) Distilled Fine-Tuning to further transfer l-MLLM capabilities. Our approach significantly improves performance without altering the small model’s architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available at https://github.com/caiyuxuan1120/LLaVA-KD.

arxiv情報

著者	Yuxuan Cai,Jiangning Zhang,Haoyang He,Xinwei He,Ao Tong,Zhenye Gan,Chengjie Wang,Xiang Bai
発行日	2024-10-21 17:41:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー