Interaction Region Visual Transformer for Egocentric Action Anticipation

要約

人間とオブジェクトのインタラクションは最も重要な視覚的手がかりの 1 つであり、私たちは自己中心的な行動の予測のために人間とオブジェクトのインタラクションを表現する新しい方法を提案します。
私たちは、アクションの実行による物体や人間の手の外観の変化を計算することでインタラクションをモデル化し、それらの変化を使用してビデオ表現を洗練するための新しいトランスフォーマーのバリアントを提案します。
具体的には、空間クロスアテンション (SCA) を使用して手とオブジェクト間のインタラクションをモデル化し、軌道クロスアテンションを使用してコンテキスト情報をさらに注入して、環境によって洗練されたインタラクショントークンを取得します。
これらのトークンを使用して、アクションを予測するためのインタラクション中心のビデオ表現を構築します。
私たちは、大規模な自己中心的データセット EPICKTICHENS100 (EK100) および EGTEA Gaze+ で最先端のアクション予測パフォーマンスを実現するモデルを InAViT と名付けます。
InAViT は、オブジェクト中心のビデオ表現を含む他のビジュアルトランスフォーマーベースの方法よりも優れたパフォーマンスを発揮します。
EK100 評価サーバーでは、InAViT は公開リーダーボード (提出時点) で最高のパフォーマンスを示しており、トップ 5 の平均再現率で 2 番目に優れたモデルを 3.3% 上回っています。

要約(オリジナル)

Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

arxiv情報

著者	Debaditya Roy,Ramanathan Rajendiran,Basura Fernando
発行日	2024-01-11 15:11:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interaction Region Visual Transformer for Egocentric Action Anticipation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー