xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

要約

このレポートでは、大規模マルチモーダルモデル (LMM) を開発するためのフレームワークである xGen-MM (BLIP-3 とも呼ばれます) を紹介します。
このフレームワークは、細心の注意を払って厳選されたデータセット、トレーニングレシピ、モデルアーキテクチャ、および結果として得られる LMM スイートで構成されます。
xGen-MM (xGen-MultiModal の略) は、基礎 AI モデルに関する Salesforce xGen イニシアチブを拡張します。
当社のモデルは、単一画像ベンチマークと複数画像ベンチマークの両方を含む、さまざまなタスクにわたって厳密な評価を受けています。
当社の事前トレーニング済みベースモデルは、強力なコンテキスト内学習機能を示し、命令調整モデルは、同様のモデルサイズを持つオープンソース LMM 間で競争力のあるパフォーマンスを示します。
さらに、幻覚などの有害行動の軽減と安全性の向上を目的としたDPO搭載のセーフティチューニングモデルを導入。
私たちは、LMM 研究のさらなる進歩を促進するために、モデル、厳選された大規模データセット、および微調整コードベースをオープンソースにしています。
関連リソースは、上記のプロジェクトページから入手できます。

要約(オリジナル)

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

arxiv情報

著者	Le Xue,Manli Shu,Anas Awadalla,Jun Wang,An Yan,Senthil Purushwalkam,Honglu Zhou,Viraj Prabhu,Yutong Dai,Michael S Ryoo,Shrikant Kendre,Jieyu Zhang,Can Qin,Shu Zhang,Chia-Chih Chen,Ning Yu,Juntao Tan,Tulika Manoj Awalgaonkar,Shelby Heinecke,Huan Wang,Yejin Choi,Ludwig Schmidt,Zeyuan Chen,Silvio Savarese,Juan Carlos Niebles,Caiming Xiong,Ran Xu
発行日	2024-08-16 17:57:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー