Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

要約

高度なテキストからイメージの生成の分野は、拡散トランスバックボーンを使用して、クリップやT5などの強力なテキストエンコーダーを統合する統合されたフレームワークの出現を目撃しています。
Cannyや深度マップなどの追加の条件で出力画像を制御する努力がありましたが、任意のテキストイメージインターリーブ制御の包括的なフレームワークはまだ不足しています。
このギャップは、生成プロセスで複数の画像から概念または視覚要素をマージしようとする場合に特に明白です。
ギャップを緩和するために、大規模なマルチモーダルモデル（LMM）が効果的な共有表現スペースを提供することを示す予備実験を実施しました。ここでは、外部拡散モデルの条件として画像とテキストを適切に調整できます。
この発見に基づいて、画像生成モデルで任意のテキストイメージインターリーブ制御のために設計された効率的で統一されたフレームワークであるDream Engineを提案します。
SD3.5などの強力なテキストからイメージモデルに基づいて、QWENVLなどの汎用性の高いマルチモーダル情報エンコーダを組み込むことにより、元のテキストのみのエンコーダーを置き換えます。
私たちのアプローチは、共同のテキストイメージのアライメントとマルチモーダルインターリーブ命令の調整で構成される2段階のトレーニングパラダイムを利用しています。
私たちの実験は、このトレーニング方法が効果的であり、遺伝的ベンチマークで0.69の総合スコアを達成し、SD3.5やフラックスなどの最先端のテキストから画像モデルのパフォーマンスに一致することを示しています。

要約(オリジナル)

The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.

arxiv情報

著者	Liang Chen,Shuai Bai,Wenhao Chai,Weichu Xie,Haozhe Zhao,Leon Vinci,Junyang Lin,Baobao Chang
発行日	2025-02-27 15:08:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー