Emu3: Next-Token Prediction is All You Need

要約

ネクストトークン予測は汎用人工知能への有望な道と考えられていますが、マルチモーダルなタスクでは優れるのに苦労しており、マルチモーダルタスクは依然として拡散モデル (安定拡散など) や構成的アプローチ (LLM と組み合わせた CLIP など) によって支配されています。
このペーパーでは、次のトークン予測のみを使用してトレーニングされた最先端のマルチモーダルモデルの新しいスイートである Emu3 を紹介します。
画像、テキスト、ビデオを個別の空間にトークン化することで、マルチモーダルシーケンスの混合で単一のトランスフォーマーをゼロからトレーニングします。
Emu3 は、生成タスクと認識タスクの両方でいくつかの確立されたタスク固有のモデルを上回り、SDXL や LLaVA-1.6 などの主力モデルを上回り、同時に拡散または合成アーキテクチャの必要性を排除します。
Emu3 は、ビデオシーケンス内の次のトークンを予測することで、高忠実度のビデオを生成することもできます。
私たちは、トークンという単一の焦点に集中することで複雑なマルチモーダルモデルの設計を簡素化し、トレーニングと推論の両方でスケーリングの大きな可能性を解き放ちます。
私たちの結果は、ネクストトークン予測が言語を超えた一般的なマルチモーダルインテリジェンスの構築に向けた有望な道であることを示しています。
私たちは、この方向でのさらなる研究をサポートするために、主要な技術とモデルをオープンソースにしています。

要約(オリジナル)

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

arxiv情報

著者	Xinlong Wang,Xiaosong Zhang,Zhengxiong Luo,Quan Sun,Yufeng Cui,Jinsheng Wang,Fan Zhang,Yueze Wang,Zhen Li,Qiying Yu,Yingli Zhao,Yulong Ao,Xuebin Min,Tao Li,Boya Wu,Bo Zhao,Bowen Zhang,Liangdong Wang,Guang Liu,Zheqi He,Xi Yang,Jingjing Liu,Yonghua Lin,Tiejun Huang,Zhongyuan Wang
発行日	2024-09-27 16:06:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Emu3: Next-Token Prediction is All You Need

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー