Image Editing As Programs with Diffusion Models

要約

拡散モデルはテキストからイメージの生成で顕著な成功を収めていますが、命令主導の画像編集で大きな課題に遭遇します。
私たちの研究は重要な課題を強調しています。これらのモデルは、実質的なレイアウトの変更を伴う構造的に一貫性のない編集と特に闘っています。
このギャップを緩和するために、拡散トランス（DIT）アーキテクチャに基づいて構築された統一された画像編集フレームワークであるImage Editing As Programs（IEAP）を紹介します。
IEAPは、そのコアで、還元主義のレンズを介して教育編集にアプローチし、複雑な編集命令を原子操作のシーケンスに分解します。
各操作は、同じDITバックボーンを共有する軽量アダプターを介して実装され、特定のタイプの編集に特化しています。
ビジョン言語モデル（VLM）ベースのエージェントによってプログラムされたこれらの操作は、arbitrary意的かつ構造的に矛盾する変換をサポートします。
この方法での編集をモジュール化およびシーケンスすることにより、IEAPは、単純な調整から実質的な構造的変化まで、幅広い編集タスク全体に堅牢に一般化されます。
広範な実験は、IEAPがさまざまな編集シナリオにわたって標準ベンチマークの最先端の方法を大幅に上回ることを示しています。
これらの評価では、私たちのフレームワークは、特に複雑でマルチステップの指示のために、優れた精度とセマンティックの忠実度を提供します。
コードはhttps://github.com/yujiahu1109/ieapで入手できます。

要約(オリジナル)

While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at https://github.com/YujiaHu1109/IEAP.

arxiv情報

著者	Yujia Hu,Songhua Liu,Zhenxiong Tan,Xingyi Yang,Xinchao Wang
発行日	2025-06-04 16:57:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Image Editing As Programs with Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー