In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

要約

命令ベースの画像編集により、自然言語プロンプトを介した堅牢な画像変更が可能になりますが、現在の方法は精密効率のトレードオフに直面しています。
微調整方法には、重要な計算リソースと大規模なデータセットが必要になりますが、トレーニングなしのテクニックは、指導の理解と編集品質に苦労しています。
大規模な拡散変圧器（DIT）の強化された生成能力とネイティブの文脈的認識を活用することにより、このジレンマを解決します。
ソリューションでは、3つの貢献を紹介します。（1）コンテキスト内のプロンプトを使用したゼロショット命令コンプライアンスのコンテキスト内編集フレームワーク、構造的変更を回避します。
（2）大規模な再訓練なしに、効率的な適応とダイナミックな専門家ルーティングで柔軟性を高めるLora-Moeハイブリッドチューニング戦略。
（3）Vision-Language Models（VLMS）を使用した初期のフィルター推論時間スケーリング法で、初期ノイズを早期に選択し、編集品質を向上させます。
広範な評価は、私たちの方法の優位性を示しています。それは、従来のベースラインと比較して、0.5％のトレーニングデータと1％のトレーニング可能なパラメーターのみを必要としながら、最先端のアプローチよりも優れています。
この作業は、高精度でありながら効率的な指導ガイド付き編集を可能にする新しいパラダイムを確立します。
コードとデモは、https：//river-zhang.github.io/icedit-gh-pages/にあります。

要約(オリジナル)

Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)’ enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method’s superiority: it outperforms state-of-the-art approaches while requiring only 0.5% training data and 1% trainable parameters compared to conventional baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

arxiv情報

著者	Zechuan Zhang,Ji Xie,Yu Lu,Zongxin Yang,Yi Yang
発行日	2025-04-29 12:14:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー