Generative Multimodal Models are In-Context Learners

要約

状況に応じて（つまり、ほんの数回のデモンストレーションや簡単な指示だけで）マルチモーダルなタスクを簡単に解決する人間の能力は、現在のマルチモーダルシステムが模倣するのにほとんど苦労しているものです。
この研究では、大規模なマルチモーダルモデルのタスクに依存しないインコンテキスト学習機能が、効果的なスケールアップによって大幅に強化できることを実証します。
Emu2 は、370 億個のパラメーターを備えた生成多峰性モデルであり、統一された自己回帰目標を備えた大規模な多峰性シーケンスでトレーニングされています。
Emu2 は強力なマルチモーダルインコンテキスト学習能力を発揮し、視覚的なプロンプトやオブジェクトに基づいた生成など、その場での推論を必要とするタスクを解決することも可能になります。
このモデルは、少数ショット設定での複数のマルチモーダルな理解タスクで新しい記録を打ち立てました。
特定の命令に従うように命令を調整すると、Emu2 はさらに、大規模なマルチモーダルモデルの質問応答ベンチマークやオープンエンドのサブジェクト駆動型生成など、困難なタスクに関して新しい最先端を達成します。
これらの成果は、Emu2 が幅広いマルチモーダルタスクのベースモデルおよび汎用インターフェイスとして機能できることを示しています。
コードとモデルは、将来の研究を容易にするために公開されています。

要約(オリジナル)

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.

arxiv情報

著者	Quan Sun,Yufeng Cui,Xiaosong Zhang,Fan Zhang,Qiying Yu,Zhengxiong Luo,Yueze Wang,Yongming Rao,Jingjing Liu,Tiejun Huang,Xinlong Wang
発行日	2023-12-20 18:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generative Multimodal Models are In-Context Learners

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー