Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

要約

私たちは、画像、テキスト、オーディオ、アクションを理解して生成できる初の自己回帰マルチモーダルモデルである Unified-IO 2 を紹介します。
さまざまなモダリティを統合するために、画像、テキスト、オーディオ、アクション、バウンディングボックスなどの入力と出力を共有セマンティック空間にトークン化し、単一のエンコーダー/デコーダー変換モデルで処理します。
このような多様なモダリティでのトレーニングは困難であるため、モデルのトレーニングを安定させるためにさまざまなアーキテクチャの改善を提案します。
デノイザー目標のマルチモーダル混合を使用して、さまざまなソースからの大規模なマルチモーダル事前トレーニングコーパス上でモデルをゼロからトレーニングします。
マルチモーダルな指示に従うなど、広範なスキルセットを学習するために、プロンプトと拡張機能を備えた 120 個のデータセットのアンサンブルを構築し、微調整します。
Unified-IO 2 は、単一の統合モデルにより、GRIT ベンチマークで最先端のパフォーマンスを達成し、画像の生成と理解、自然言語の理解、ビデオと音声の理解、ロボット操作を含む 35 を超えるベンチマークで優れた結果を達成しました。
。
私たちはすべてのモデルを研究コミュニティにリリースします。

要約(オリジナル)

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs — images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

arxiv情報

著者	Jiasen Lu,Christopher Clark,Sangho Lee,Zichen Zhang,Savya Khosla,Ryan Marten,Derek Hoiem,Aniruddha Kembhavi
発行日	2023-12-28 17:57:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー