Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection

要約

Human-Object Interaction(HOI)認識の鍵は、人間と物体の関係を推論することである。近年、画像のHuman-Object Interaction(HOI)検出は大きな進歩を遂げている。しかし、映像のHOI検出性能にはまだ改善の余地があります。既存のワンステージ法は、よく設計されたエンドツーエンドネットワークを使用して、ビデオセグメントを検出し、直接インタラクションを予測します。それは、モデルの学習とネットワークのさらなる最適化をより複雑にしている。本論文では、動画全体を人間と物体のノードを持つ時空間グラフとして入力とするSPDTP(Spatial Parsing and Dynamic Temporal Pooling)ネットワークを導入している。既存の手法とは異なり、提案ネットワークは、明示的な空間解析により、インタラクティブと非インタラクティブのペアの違いを予測し、その後、インタラクション認識を実行する。さらに、学習可能で微分可能なDynamic Temporal Module(DTM)を提案し、映像のキーフレームを強調し、冗長なフレームを抑制することで、映像のキーフレームを認識する。さらに、実験結果から、SPDTPはアクティブな人間と物体のペアと有効なキーフレームにより多くの注意を払うことができることが示された。全体として、我々はCAD-120データセットとSomething-Elseデータセットにおいて、最新の性能を達成した。

要約(オリジナル)

The key of Human-Object Interaction(HOI) recognition is to infer the relationship between human and objects. Recently, the image’s Human-Object Interaction(HOI) detection has made significant progress. However, there is still room for improvement in video HOI detection performance. Existing one-stage methods use well-designed end-to-end networks to detect a video segment and directly predict an interaction. It makes the model learning and further optimization of the network more complex. This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as a spatio-temporal graph with human and object nodes as input. Unlike existing methods, our proposed network predicts the difference between interactive and non-interactive pairs through explicit spatial parsing, and then performs interaction recognition. Moreover, we propose a learnable and differentiable Dynamic Temporal Module(DTM) to emphasize the keyframes of the video and suppress the redundant frame. Furthermore, the experimental results show that SPDTP can pay more attention to active human-object pairs and valid keyframes. Overall, we achieve state-of-the-art performance on CAD-120 dataset and Something-Else dataset.

arxiv情報

著者	Hongsheng Li,Guangming Zhu,Wu Zhen,Lan Ni,Peiyi Shen,Liang Zhang,Ning Wang,Cong Hua
発行日	2022-06-07 07:26:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー