UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

要約

既存の統一的なモデルは、視覚言語理解やテキストから画像への生成では強力な性能を発揮しますが、幅広い応用のためにユーザが切望している画像知覚や操作タスクの探求には限界があります。最近、OpenAIは、包括的な画像知覚と操作のための強力なGPT-4o-Imageモデルをリリースし、表現能力を達成し、コミュニティの関心を集めました。GPT-4o-Imageの性能を注意深く構築した実験で観察することで、GPT-4o-ImageはVAEの代わりにセマンティックエンコーダによって抽出された特徴を活用していることが推測される。このような刺激的な観察に動機づけられ、我々は、強力な視覚言語モデルと対照的な意味エンコーダによって提供される意味的特徴に基づく、UniWorldと名付けられた統一的な生成フレームワークを提示する。その結果、BAGELのわずか1％のデータ量で強力な統一モデルを構築し、画像編集ベンチマークにおいて常にBAGELを上回る性能を発揮する。UniWorldはまた、競争力のある画像理解と生成能力を維持し、複数の画像知覚タスクで強力な性能を達成しています。UniWorldは、モデルの重み、学習・評価スクリプト、データセットを含め、モデルを完全にオープンソース化しています。

要約(オリジナル)

Although existing unified models deliver strong performance on vision-language understanding and text-to-image generation, their models are limited in exploring image perception and manipulation tasks, which are urgently desired by users for wide applications. Recently, OpenAI released their powerful GPT-4o-Image model for comprehensive image perception and manipulation, achieving expressive capability and attracting community interests. By observing the performance of GPT-4o-Image in our carefully constructed experiments, we infer that GPT-4o-Image leverages features extracted by semantic encoders instead of VAE, while VAEs are considered essential components in many image manipulation models. Motivated by such inspiring observations, we present a unified generative framework named UniWorld based on semantic features provided by powerful visual-language models and contrastive semantic encoders. As a result, we build a strong unified model using only 1% amount of BAGEL’s data, which consistently outperforms BAGEL on image editing benchmarks. UniWorld also maintains competitive image understanding and generation capabilities, achieving strong performance across multiple image perception tasks. We fully open-source our models, including model weights, training and evaluation scripts, and datasets.

arxiv情報

著者	Bin Lin,Zongjian Li,Xinhua Cheng,Yuwei Niu,Yang Ye,Xianyi He,Shenghai Yuan,Wangbo Yu,Shaodong Wang,Yunyang Ge,Yatian Pang,Li Yuan
発行日	2025-06-03 17:59:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー