Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

要約

ビデオトランスフォーマーは人間の動作認識の事実上の標準になっていますが、RGB モダリティに独占的に依存しているため、特定の分野での採用は依然として制限されています。
そのようなドメインの 1 つは日常生活活動 (ADL) であり、RGB だけでは視覚的に類似したアクション、または複数の視点から観察されたアクションを区別するのに十分ではありません。
ADL 用のビデオトランスフォーマーの採用を促進するには、きめの細かい動きと複数の視点に対する感度で知られる人間の姿勢情報による RGB の強化が不可欠であると仮説を立てます。
その結果、最初のポーズ誘導ビデオトランスフォーマーである PI-ViT (または $\pi$-ViT) を導入します。これは、ビデオトランスフォーマーによって学習された RGB 表現を 2D および 3D ポーズ情報で拡張する新しいアプローチです。
$\pi$-ViT の主要な要素は、2D スケルトン誘導モジュールと 3D スケルトン誘導モジュールという 2 つのプラグインモジュールで、2D および 3D のポーズ情報を RGB 表現に誘導します。
これらのモジュールは、ポーズを認識した補助タスクを実行することによって動作します。これは、$\pi$-ViT が推論中にモジュールを破棄できる設計上の選択です。
特に、$\pi$-ViT は、推論時にポーズや追加の計算オーバーヘッドを必要とせずに、現実世界と大規模な RGB-D データセットの両方を含む 3 つの著名な ADL データセットで最先端のパフォーマンスを達成します。

要約(オリジナル)

Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows $\pi$-ViT to discard the modules during inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.

arxiv情報

著者	Dominick Reilly,Srijan Das
発行日	2023-11-30 18:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー