Human-Object Interaction Prediction in Videos through Gaze Following

要約

視覚的なシーンを完全に理解するには、ビデオから人間とオブジェクトのインタラクション (HOI) を理解することが不可欠です。
この一連の研究は、画像から、そして最近ではビデオから HOI を検出することで解決されています。
ただし、三人称視点でのビデオベースの HOI 予測タスクはまだ研究が進んでいません。
この論文では、ビデオ内の現在の HOI を検出し、将来の HOI を予測するフレームワークを設計します。
人間はオブジェクトと対話する前にオブジェクトに注目することが多いため、人間の視線情報を活用することを提案します。
これらの視線の特徴は、シーンのコンテキストおよび人間とオブジェクトのペアの視覚的外観とともに、時空間変換器を通じて融合されます。
複数人のシナリオで HOI 予測タスクでモデルを評価するために、個人ごとのマルチラベルメトリクスのセットを提案します。
私たちのモデルは、日常生活を捉えたビデオを含む VidHOI データセットでトレーニングおよび検証されており、現在最大のビデオ HOI データセットです。
HOI 検出タスクの実験結果は、私たちのアプローチがベースラインを相対的に 36.3% という大きなマージンで改善することを示しています。
さらに、時空間変換器の修正と拡張の有効性を実証するために、広範なアブレーション研究を実施しています。
私たちのコードは https://github.com/nizhf/hoi-prediction-gaze-transformer で公開されています。

要約(オリジナル)

Understanding the human-object interactions (HOIs) from a video is essential to fully comprehend a visual scene. This line of research has been addressed by detecting HOIs from images and lately from videos. However, the video-based HOI anticipation task in the third-person view remains understudied. In this paper, we design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human gaze information since people often fixate on an object before interacting with it. These gaze features together with the scene contexts and the visual appearances of human-object pairs are fused through a spatio-temporal transformer. To evaluate the model in the HOI anticipation task in a multi-person scenario, we propose a set of person-wise multi-label metrics. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life and is currently the largest video HOI dataset. Experimental results in the HOI detection task show that our approach improves the baseline by a great margin of 36.3% relatively. Moreover, we conduct an extensive ablation study to demonstrate the effectiveness of our modifications and extensions to the spatio-temporal transformer. Our code is publicly available on https://github.com/nizhf/hoi-prediction-gaze-transformer.

arxiv情報

著者	Zhifan Ni,Esteve Valls Mascaró,Hyemin Ahn,Dongheui Lee
発行日	2023-06-06 11:36:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human-Object Interaction Prediction in Videos through Gaze Following

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー