MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

要約

テキストガイド付き画像編集モデルは、一般的な領域で大きな成功を収めています。
ただし、これらのモデルをファッション領域に直接適用すると、次の 2 つの問題が発生する可能性があります。(1) 編集領域の位置特定が不正確である。
(2) 編集強度が弱い。
これらの問題に対処するために、MADiff モデルが提案されています。
具体的には、編集領域をより正確に特定するために、MaskNet が提案されています。MaskNet では、大規模な言語モデルからの前景領域、密ポーズ、およびマスクプロンプトが軽量 UNet に入力され、編集領域のマスクが予測されます。
編集の大きさを強化するために、アテンション強化拡散モデルが提案されます。このモデルでは、ノイズマップ、アテンションマップ、および MaskNet からのマスクが提案されたアテンションプロセッサに供給されて、洗練されたノイズマップが生成されます。
洗練されたノイズマップを拡散モデルに統合することで、編集された画像をターゲットプロンプトとより適切に一致させることができます。
ファッション画像編集のベンチマークがないことを考慮して、トレーニングセットの 28390 個の画像とテキストのペア、および評価セットの 4 種類のファッションタスク用の 2639 個の画像とテキストのペアで構成される、Fashion-E という名前のデータセットを構築しました。
Fashion-Eに関する広範な実験により、提案した方法が編集領域のマスクを正確に予測し、最先端の方法と比較してファッション画像編集における編集量を大幅に向上できることが実証されました。

要約(オリジナル)

Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foreground region, densepose and mask prompts from large language model are fed into a lightweight UNet to predict the mask for editing region. To strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is proposed, where the noise map, attention map, and the mask from MaskNet are fed into the proposed Attention Processor to produce a refined noise map. By integrating the refined noise map into the diffusion model, the edited image can better align with the target prompt. Given the absence of benchmarks in fashion image editing, we constructed a dataset named Fashion-E, comprising 28390 image-text pairs in the training set, and 2639 image-text pairs for four types of fashion tasks in the evaluation set. Extensive experiments on Fashion-E demonstrate that our proposed method can accurately predict the mask of editing region and significantly enhance editing magnitude in fashion image editing compared to the state-of-the-art methods.

arxiv情報

著者	Zechao Zhan,Dehong Gao,Jinxia Zhang,Jiale Huang,Yang Hu,Xin Wang
発行日	2025-01-15 15:53:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー