Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

要約

さまざまなビジョンや言語タスクが可能なマルチモーダルの自己回帰モデルのファミリーであるLumina-Mgptを紹介します。特に、テキストの説明から柔軟なフォトリアリックな画像を生成するのに優れています。
マルチモーダル生成前脱同胞（MGPT）から初期化することにより、デコーダーのみのオートレーリング（AR）モデルが、柔軟なプログレッシブな監視された微調整（FP-SFT）を介して高効率を備えた最新の拡散モデルに匹敵する画像生成パフォーマンスを実現できることを実証します。
提案されている明確な画像表現（UNIREP）を装備したLumina-MGPTは、さまざまなアスペクト比の高品質の画像を柔軟に生成できます。
強力な画像生成能力に基づいて、ルミナ-MGPTを統一されたマルチモーダルジェネラリストに昇格させる最初の試みである、監視された微調整（OMNI-SFT）の監視された任意の微調整をさらに探ります。
結果として得られるモデルは、テキストからイメージ/マルチビュー生成、制御可能な生成などの視覚的な生成タスク、セグメンテーションや深さ推定などの視覚認識タスク、マルチターンの視覚的質問のような視覚言語タスクなど、技術的方向のバラ色の可能性を示す視覚的認識タスクなど、多目的なマルチモーダル機能を示しています。
コードとチェックポイントは、https：//github.com/alpha-vllm/lumina-mgptで入手できます。

要約(オリジナル)

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, showing the rosy potential of the technical direction. Codes and checkpoints are available at https://github.com/Alpha-VLLM/Lumina-mGPT.

arxiv情報

著者	Dongyang Liu,Shitian Zhao,Le Zhuo,Weifeng Lin,Yu Qiao,Hongsheng Li,Peng Gao
発行日	2025-04-18 15:32:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー