Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

要約

ビデオバーチャルトライオンは、特定の衣服を備えたビデオで主題をシームレスにドレスアップすることを目指しています。
主な課題は、被験者のポーズと体格に動的に適応しながら、衣服の視覚的な真正性を維持することです。
既存の方法は主に画像ベースの仮想トライオンに焦点を合わせていますが、これらの手法を直接ビデオに拡張すると、しばしば時間的な矛盾が生じます。
現在のほとんどのビデオ仮想トライオンアプローチは、時間モジュールを組み込むことによりこの課題を軽減しますが、それでも人間と衣服の間の重要な時空のポーズ相互作用を見落としています。
ビデオでの効果的なポーズ相互作用は、各フレームの人間と衣服のポーズの間の空間的アライメントを考慮するだけでなく、ビデオ全体の人間のポーズの時間的ダイナミクスを説明する必要があります。
このような動機により、新しいフレームワーク、すなわち動的なポーズインタラクション拡散モデル（DPIDM）を提案し、拡散モデルを活用してビデオ仮想トライオンの動的なポーズ相互作用を掘り下げます。
技術的には、DPIDMはスケルトンベースのポーズアダプターを導入して、同期された人間と衣服のポーズを除去ネットワークに統合します。
階層的な注意モジュールは、ポーズ認識の空間的および時間的注意メカニズムを介して、フレーム間のフレーミング内のヒューマンガーメントポーズの相互作用と長期のヒトポーズダイナミクスをモデル化するように非常に設計されています。
さらに、DPIDMは、連続したフレーム間の時間的な正規化された注意損失を活用して、時間的一貫性を高めます。
Viton-HD、VVT、およびVividデータセットで実施された広範な実験は、ベースラインメソッドに対するDPIDMの優位性を示しています。
特に、DPIDMはVVTデータセットで0.506のVFIDスコアを達成し、最先端のGPD-VVTOアプローチで60.5％の改善をもたらしました。

要約(オリジナル)

Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.

arxiv情報

著者	Dong Li,Wenqi Zhong,Wei Yu,Yingwei Pan,Dingwen Zhang,Ting Yao,Junwei Han,Tao Mei
発行日	2025-05-22 17:52:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー