Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

要約

拡散ベースの画像編集 (DIE) は、新たな研究のホットスポットであり、多くの場合、拡散ベースの編集の対象領域を制御するためにセマンティックマスクを適用します。
ただし、既存のソリューションのほとんどは、手動操作またはオフライン処理によってこれらのマスクを取得するため、効率が大幅に低下します。
この論文では、インスタント拡散編集(InstDiffEdit)と呼ばれる、Text-to-Image(T2I)拡散モデルのための新規で効率的な画像編集方法を提案します。
特に、InstDiffEdit は、既存の拡散モデルのクロスモーダルアテンション機能を利用して、拡散ステップ中に瞬時のマスクガイダンスを実現することを目的としています。
アテンションマップのノイズを低減し、完全な自動化を実現するために、InstDiffEdit にトレーニング不要の改良スキームを装備し、自動かつ正確なマスク生成のためにアテンション分布を適応的に集約します。
一方、DIE の既存の評価を補足するために、既存の手法のマスク精度とローカル編集能力を調べるための Editing-Mask と呼ばれる新しいベンチマークを提案します。
InstDiffEdit を検証するために、ImageNet と Imagen で広範な実験を実施し、それを一連の SOTA メソッドと比較しました。
実験結果は、InstDiffEdit が画質と編集結果の両方で SOTA メソッドよりも優れているだけでなく、推論速度もはるかに高速 (+5 ～ +6 倍) であることを示しています。

要約(オリジナル)

Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times.

arxiv情報

著者	Siyu Zou,Jiji Tang,Yiyi Zhou,Jing He,Chaoyi Zhao,Rongsheng Zhang,Zhipeng Hu,Xiaoshuai Sun
発行日	2024-01-23 11:22:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー