Enhanced Multi-Scale Cross-Attention for Person Image Generation

要約

本稿では、挑戦的な人物画像生成タスクのために、新しいクロスアテンションベースの敵対的生成ネットワーク (GAN) を提案します。
クロスアテンションは、異なるモダリティの 2 つの特徴マップ間で注意/相関行列が計算される、新規で直感的なマルチモーダル融合手法です。
具体的には、人の外見と形状をそれぞれ捕捉する 2 つの世代ブランチで構成される新しい XingGAN (または CrossingGAN) を提案します。
さらに、相互改善のために人の形状と外観の埋め込みを効果的に転送および更新するための 2 つの新しいクロスアテンションブロックを提案します。
これは、他の既存の GAN ベースの画像生成作業では考慮されていません。
異なるスケールおよびサブ領域での異なる人物のポーズ間の長距離相関をさらに学習するために、2 つの新しいマルチスケール相互注意ブロックを提案します。
パフォーマンスの向上を妨げる、ノイズが多く曖昧なアテンションの重みにつながる、クロスアテンションメカニズム内の独立した相関計算の問題に取り組むために、拡張アテンション (EA) と呼ばれるモジュールを提案します。
最後に、さまざまな段階で外観と形状の特徴を効果的に融合するための、新しい高密度接続された同時注意モジュールを紹介します。
2 つの公開データセットに対する広範な実験により、提案された手法が現在の GAN ベースの手法を上回り、拡散ベースの手法と同等のパフォーマンスを発揮することが実証されました。
ただし、私たちの方法は、トレーニングと推論の両方において拡散ベースの方法よりも大幅に高速です。

要約(オリジナル)

In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person’s appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person’s shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.

arxiv情報

著者	Hao Tang,Ling Shao,Nicu Sebe,Luc Van Gool
発行日	2025-01-15 16:08:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhanced Multi-Scale Cross-Attention for Person Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー