UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

要約

既存の統一モデルは、視覚言語理解やテキストから画像への生成では高い性能を発揮しますが、実用的なアプリケーションでますます要求されるようになった画像知覚や画像操作への対応には依然として限界があります。最近、OpenAIは強力なGPT-4o-Imageモデルを発表し、包括的な画像知覚と操作の高度な能力を示し、広く関心を呼び起こした。慎重に設計された実験を通して、我々はGPT-4-o-Imageが特徴抽出のためにVAEよりもむしろセマンティックエンコーダに依存している可能性が高いことを観察した。この洞察に触発され、我々はUniWorldを提案する。UniWorldは、強力なマルチモーダル大規模言語モデルと対照的意味エンコーダから抽出された意味特徴に基づいて構築された統一的な生成フレームワークである。UniWorldは、わずか270万個の学習データを用いて、画像理解、生成、操作、知覚を含む多様なタスクにおいて素晴らしい性能を達成した。UniWorldフレームワークは、モデル重み、学習・評価スクリプト、データセットを含め、完全にオープンソース化されており、再現性とさらなる研究を促進する。

要約(オリジナル)

Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation — capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.

arxiv情報

著者	Bin Lin,Zongjian Li,Xinhua Cheng,Yuwei Niu,Yang Ye,Xianyi He,Shenghai Yuan,Wangbo Yu,Shaodong Wang,Yunyang Ge,Yatian Pang,Li Yuan
発行日	2025-06-04 14:45:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー