RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

要約

単一のフレームワーク内の多様な画像生成タスクを統合することは、視覚生成における根本的な課題のままです。
大規模な言語モデル（LLM）は、タスクに依存しないデータと生成を通じて統一を実現しますが、既存の視覚生成モデルはこれらの原則を満たすことができません。
現在のアプローチは、タスクごとのデータセットと大規模なトレーニングに依存しているか、タスク固有の変更を加えて事前に訓練された画像モデルを適応させ、一般化を制限しています。
この作業では、統一された画像生成の基礎としてビデオモデルを探索し、時間的相関をモデル化する固有の能力を活用します。
LLMSでのコンテキスト内学習に類似した、画像生成を条件付きフレーム予測タスクとして再フォーマル化する新しいフレームワークであるRealGeneralを紹介します。
ビデオモデルと条件イメージのペアの間のギャップを埋めるために、（1）マルチモーダルアライメントのための統一された条件付き埋め込みモジュールと、（2）分離された適応レイヤーームと注意マスクを備えた統一ストリームDITブロックを提案します。
RealGeneralは、複数の重要な視覚生成タスクの有効性を実証しています。たとえば、カスタマイズされた生成の被験者の類似性の14.5％の改善と、Canny-to-Imageタスクの画質の10％の向上を実現します。
プロジェクトページ：https：//lyne1.github.io/realgeneral/

要約(オリジナル)

Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: https://lyne1.github.io/RealGeneral/

arxiv情報

著者	Yijing Lin,Mengqi Huang,Shuhan Zhuang,Zhendong Mao
発行日	2025-03-13 14:31:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー