GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

要約

現在の画像生成と編集方法は、主に視覚構成や明示的な操作について推論することなく、テキストプロンプトを直接入力として処理します。
私たちは、画像を出力する前に明示的な言語推論プロセスを通じて生成と編集を可能にする新しいパラダイムである、ジェネレーションチェーンオブシャーチ（GOT）を提示します。
このアプローチは、従来のテキストからイメージまでの生成と編集を、セマンティックな関係と空間的配置を分析する推論ガイド付きフレームワークに変換します。
GOTの定式化を定義し、セマンティック空間関係をキャプチャする詳細な推論チェーンを備えた9mを超えるサンプルを含む大規模なGOTデータセットを構築します。
GOTの利点を活用するために、QWEN2.5-VLを推論チェーン生成に統合する統合フレームワークを実装し、新しいセマンティック空間ガイダンスモジュールによって強化されたエンドツーエンドの拡散モデルを実装します。
実験は、GOTフレームワークが生成タスクと編集タスクの両方で優れたパフォーマンスを達成し、ベースラインよりも大幅に改善されていることを示しています。
さらに、当社のアプローチにより、インタラクティブな視覚生成が可能になり、ユーザーは正確な画像調整のための推論ステップを明示的に変更できます。
先駆者に、推論主導の視覚的生成と編集のための新しい方向性を獲得し、人間の意図とより適合する画像を作成しました。
将来の研究を促進するために、データセット、コード、および事前処理されたモデルをhttps://github.com/rongyaofang/gotで公開しています。

要約(オリジナル)

Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.

arxiv情報

著者	Rongyao Fang,Chengqi Duan,Kun Wang,Linjiang Huang,Hao Li,Shilin Yan,Hao Tian,Xingyu Zeng,Rui Zhao,Jifeng Dai,Xihui Liu,Hongsheng Li
発行日	2025-03-13 17:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー