Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

要約

CM3Leon (「カメレオン」と発音) は、テキストと画像の両方を生成および埋め込むことができる、検索拡張されたトークンベースのデコーダー専用マルチモーダル言語モデルです。
CM3Leon は CM3 マルチモーダルアーキテクチャを使用しますが、さらに、より多様な命令形式のデータをスケールアップして調整することによる大きな利点も示します。
これは、大規模な検索拡張事前トレーニング段階と 2 番目のマルチタスク教師あり微調整 (SFT) 段階を含む、テキストのみの言語モデルから適応したレシピでトレーニングされた初のマルチモーダルモデルです。
また、テキストから画像への生成と画像からテキストへの生成の両方を実行できる汎用モデルでもあるため、高品質の出力を生成する自己完結型のコントラストデコード方法を導入することができます。
広範な実験により、このレシピがマルチモーダルモデルに対して非常に効果的であることが実証されました。
CM3Leon は、同等の方法 (ゼロショット MS-COCO FID 4.88) に比べて 5 分の 1 のトレーニングコンピューティングで、テキストから画像への生成において最先端のパフォーマンスを実現します。
SFT の後、CM3Leon は、言語ガイドによる画像編集から画像制御による生成とセグメンテーションに至るまでのタスクにおいて、前例のないレベルの制御性を実証することもできます。

要約(オリジナル)

We present CM3Leon (pronounced ‘Chameleon’), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

arxiv情報

著者	Lili Yu,Bowen Shi,Ramakanth Pasunuru,Benjamin Muller,Olga Golovneva,Tianlu Wang,Arun Babu,Binh Tang,Brian Karrer,Shelly Sheynin,Candace Ross,Adam Polyak,Russell Howes,Vasu Sharma,Puxin Xu,Hovhannes Tamoyan,Oron Ashual,Uriel Singer,Shang-Wen Li,Susan Zhang,Richard James,Gargi Ghosh,Yaniv Taigman,Maryam Fazel-Zarandi,Asli Celikyilmaz,Luke Zettlemoyer,Armen Aghajanyan
発行日	2023-09-05 21:27:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー