SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

要約

私たちは、SPHINX をベースに開発された広範なマルチモダリティ大規模言語モデル (MLLM) シリーズである SPHINX-X を提案します。
アーキテクチャとトレーニングの効率を向上させるために、冗長なビジュアルエンコーダーを削除し、スキップトークンで完全に埋め込まれたサブイメージをバイパスし、マルチステージトレーニングを 1 ステージのオールインワンパラダイムに簡素化することで、SPHINX フレームワークを変更します。
MLLM の可能性を最大限に引き出すために、言語、視覚、および視覚言語タスクにおける公的に利用可能なリソースをカバーする包括的なマルチドメインおよびマルチモーダルデータセットを組み立てます。
私たちは、厳選された OCR 集約データセットと Set-of-Mark データセットでこのコレクションをさらに充実させ、多様性と汎用性を拡張します。
TinyLlama1.1B、InternLM2-7B、LLaMA2-13B、Mixtral8x7B などのさまざまなベース LLM でトレーニングすることにより、パラメーターサイズと多言語機能が異なる MLLM のスペクトルを取得します。
包括的なベンチマークにより、マルチモーダルパフォーマンスとデータおよびパラメータースケール間の強い相関関係が明らかになります。
コードとモデルは https://github.com/Alpha-VLLM/LLaMA2- Accessories でリリースされています。

要約(オリジナル)

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

arxiv情報

著者	Peng Gao,Renrui Zhang,Chris Liu,Longtian Qiu,Siyuan Huang,Weifeng Lin,Shitian Zhao,Shijie Geng,Ziyi Lin,Peng Jin,Kaipeng Zhang,Wenqi Shao,Chao Xu,Conghui He,Junjun He,Hao Shao,Pan Lu,Hongsheng Li,Yu Qiao
発行日	2024-02-08 18:59:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー