VILA: On Pre-training for Visual Language Models

要約

視覚言語モデル (VLM) は、大規模な言語モデルの最近の成功により急速に進歩しました。
視覚的な入力を使用して LLM を拡張するための視覚的命令のチューニングに関する取り組みは増えていますが、モデルが両方のモダリティで共同モデリングを実行することを学習する視覚言語の事前トレーニングプロセスに関する詳細な研究は不足しています。
この研究では、段階的に制御可能な比較を通じて LLM を VLM に向けて強化することにより、VLM 事前トレーニングの設計オプションを検討します。
3 つの主な発見を紹介します。(1) 事前トレーニング中に LLM をフリーズすると、適切なゼロショットパフォーマンスを達成できますが、コンテキスト内学習機能が欠けているため、LLM をフリーズ解除する必要があります。
(2) インターリーブされた事前トレーニングデータは有益ですが、画像とテキストのペアだけでは最適ではありません。
(3) 命令の微調整中にテキストのみの命令データを画像テキストデータに再ブレンドすることで、テキストのみのタスクの劣化が改善されるだけでなく、VLM タスクの精度も向上します。
強化された事前トレーニングレシピを使用して、主要なベンチマーク全体で、付加機能なしで最先端のモデル (LLaVA-1.5 など) を常に上回るパフォーマンスを発揮するビジュアル言語モデルファミリである VILA を構築します。
マルチモーダル事前トレーニングは、マルチ画像推論、強化されたコンテキスト内学習、より優れた世界知識など、VILA の魅力的な特性を明らかにするのにも役立ちます。

要約(オリジナル)

Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge.

arxiv情報

著者	Ji Lin,Hongxu Yin,Wei Ping,Yao Lu,Pavlo Molchanov,Andrew Tao,Huizi Mao,Jan Kautz,Mohammad Shoeybi,Song Han
発行日	2023-12-12 18:58:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VILA: On Pre-training for Visual Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー