Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

要約

拡散モデルは、テキストからイメージの生成の主流のアーキテクチャとなっており、視覚的な品質と迅速な制御可能性の顕著な進歩を達成しています。
ただし、現在の推論パイプラインには、一般に、除去プロセス全体で解釈可能なセマンティック監督と修正メカニズムがありません。
ほとんどの既存のアプローチは、最終的な画像の事後スコアリング、プロンプトフィルタリング、または生成軌道を修正するための実用的なガイダンスを提供する際に効果がないヒューリスティックなリサンプリング戦略のみに依存しています。
その結果、モデルはしばしば、オブジェクトの混乱、空間エラー、不正確なカウント、およびセマンティック要素の欠落に悩まされ、迅速な画像のアライメントと画質を厳しく妥協します。
これらの課題に取り組むために、MLLMセマンティック補正されたPing-Pong-Ahead Diffusion（PPAD）を提案します。これは、初めて、推論中にマルチモーダル大言語モデル（MLLM）をセマンティックオブザーバーとして導入する新しいフレームワークです。
PPADは、中間世代のリアルタイム分析を実行し、潜在的なセマンティックな矛盾を特定し、フィードバックを残りの除去ステップを積極的にガイドする制御可能な信号に変換します。
このフレームワークは、推論のみとトレーニングが強化された設定の両方をサポートし、非常に少ない拡散ステップでのみセマンティック修正を実行し、強力な一般性とスケーラビリティを提供します。
広範な実験は、PPADの大幅な改善を示しています。

要約(オリジナル)

Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference. PPAD performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps. The framework supports both inference-only and training-enhanced settings, and performs semantic correction at only extremely few diffusion steps, offering strong generality and scalability. Extensive experiments demonstrate PPAD’s significant improvements.

arxiv情報

著者	Zheqi Lv,Junhao Chen,Qi Tian,Keting Yin,Shengyu Zhang,Fei Wu
発行日	2025-05-26 14:42:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー