Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning


この課題に対処するために、複数人の人間のモーション データを対象とした、シンプルかつ効果的なトランスフォーマー ベースのマスクされたオートエンコーダー フレームワークである Social-MAE を導入します。
Social-MAE は、MAE エンコーダとしてのトランスフォーマーと、周波数領域での複数人の関節の軌道で動作する MAE デコーダーとしての軽量トランスで構成されます。
再構成タスクの後、MAE デコーダーはタスク固有のデコーダーに置き換えられ、モデルはさまざまな高レベルの社会的タスクに合わせてエンドツーエンドで微調整されます。
これらの改善は、人間の 2D と 3D の体のポーズの両方を含む 4 つの一般的な複数人データセットにわたって実証されています。


For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial. Progress towards models that can fully understand scenes involving multiple people is hindered by a lack of sufficient annotated data for such high-level tasks. To address this challenge, we introduce Social-MAE, a simple yet effective transformer-based masked autoencoder framework for multi-person human motion data. The framework uses masked modeling to pre-train the encoder to reconstruct masked human joint trajectories, enabling it to learn generalizable and data efficient representations of motion in human crowded scenes. Social-MAE comprises a transformer as the MAE encoder and a lighter-weight transformer as the MAE decoder which operates on multi-person joints’ trajectory in the frequency domain. After the reconstruction task, the MAE decoder is replaced with a task-specific decoder and the model is fine-tuned end-to-end for a variety of high-level social tasks. Our proposed model combined with our pre-training approach achieves the state-of-the-art results on various high-level social tasks, including multi-person pose forecasting, social grouping, and social action understanding. These improvements are demonstrated across four popular multi-person datasets encompassing both human 2D and 3D body pose.


著者 Mahsa Ehsanpour,Ian Reid,Hamid Rezatofighi
発行日 2024-04-08 14:54:54+00:00
