An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

要約

ビジュアル命令のチューニングは、最近、LLaVA や MiniGPT-4 などのオープンソースの大規模マルチモーダルモデル (LMM) で有望な進歩を示しています。
ただし、オープンソース LMM の既存の研究のほとんどは、13B 以下のパラメーターを持つモデルを使用して実行されています。
この論文では、LLaVA を 33B および 65B/70B までスケーリングする実証的研究を紹介し、画像解像度、データ混合、および LoRA/QLoRA などのパラメーター効率の高いトレーニング方法の探索から得た知見を共有します。
これらは、実際の環境で現実世界のタスクを完了するときのマルチモーダル機能と言語機能への影響によって評価されます。
LMM のスケーリングによりモデルのパフォーマンスが一貫して向上し、言語機能が向上し、LMM の LoRA/QLoRA チューニングのパフォーマンスがフルモデルの微調整のパフォーマンスに匹敵することがわかりました。
さらに、この研究では、LMM のパフォーマンスを向上させるために、より高い画像解像度とマルチモーダル言語データの混合の重要性が強調されており、視覚的な命令のチューニングにより LMM の純粋な言語機能が向上する場合もあります。
この研究により、大規模な最先端の LMM 研究がより利用しやすくなり、将来の研究のためのより強力なベースラインの確立に役立つことを願っています。
コードとチェックポイントは公開されます。

要約(オリジナル)

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM’s pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

arxiv情報

著者	Yadong Lu,Chunyuan Li,Haotian Liu,Jianwei Yang,Jianfeng Gao,Yelong Shen
発行日	2023-09-18 17:30:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー