What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

要約

GPT4 などの大規模言語モデル (LLM) の最近の進歩により、画像が与えられた無制限の命令に従う際に優れたマルチモーダル機能が示されました。
ただし、これらのモデルのパフォーマンスは、ネットワーク構造、トレーニングデータ、トレーニング戦略などの設計の選択に大きく依存しており、これらの選択については文献で十分に議論されていないため、この分野の進歩を定量化することが困難です。
この問題に対処するために、この論文では、そのようなモデルのトレーニングに関する定量的および定性的な体系的かつ包括的な研究を紹介します。
制御された設定を使用して 20 を超えるバリアントを実装します。
具体的には、ネットワーク構造について、さまざまな LLM バックボーンとモデル設計を比較します。
トレーニングデータについては、データとサンプリング戦略の影響を調査します。
指示については、訓練されたモデルの指示に従う能力に対する多様なプロンプトの影響を調査します。
ベンチマークについては、私たちの最善の知識として、クラウドソーシングを通じて画像とビデオの両方のタスクを含む包括的な評価セットを最初に提供します。
私たちの調査結果に基づいて、既存のオープンソース GPT4 スタイルモデルと比較して最高のマルチモーダル生成能力を維持しながら、最も正確なマルチモーダル理解を実行する Lynx を紹介します。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following ability of the trained models. For benchmarks, we contribute the first, to our best knowledge, comprehensive evaluation set including both image and video tasks through crowd-sourcing. Based on our findings, we present Lynx, which performs the most accurate multi-modal understanding while keeping the best multi-modal generation ability compared to existing open-sourced GPT4-style models.

arxiv情報

著者	Yan Zeng,Hanbo Zhang,Jiani Zheng,Jiangnan Xia,Guoqiang Wei,Yang Wei,Yuchen Zhang,Tao Kong
発行日	2023-07-05 17:44:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー