PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

要約

我々は、マルチスケールの時間的特徴を利用して、ビデオ内の複雑な時間的共起アクションの依存関係を学習するトランスフォーマーベースのネットワークである PAT を紹介します。
既存の方法では、トランスフォーマーの自己注意メカニズムにより、ロバストなアクション検出に不可欠な時間的位置情報が失われます。
この問題に対処するために、我々は、(i) セルフアテンションメカニズムに相対位置エンコーディングを埋め込み、(ii) 階層ネットワークを使用する最近のトランスフォーマーベースのアプローチとは対照的に、新しい非階層ネットワークを設計することによってマルチスケールの時間的関係を利用します。
構造。
我々は、自己注意メカニズムと階層的アプローチにおける複数のサブサンプリングプロセスを結合すると、位置情報の損失が増加すると主張します。
2 つの挑戦的な高密度マルチラベルベンチマークデータセットで提案したアプローチのパフォーマンスを評価し、PAT が現在の最先端の結果を Charades および MultiTHUMOS データセットでそれぞれ 1.1% および 0.6% mAP 改善することを示します。
それぞれ 26.5% と 44.6% で新しい最先端の mAP を達成しました。
また、当社が提案するネットワークのさまざまなコンポーネントの影響を調べるために、広範なアブレーション研究も実施しています。

要約(オリジナル)

We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing methods, the self-attention mechanism in transformers loses the temporal positional information, which is essential for robust action detection. To address this issue, we (i) embed relative positional encoding in the self-attention mechanism and (ii) exploit multi-scale temporal relationships by designing a novel non hierarchical network, in contrast to the recent transformer-based approaches that use a hierarchical structure. We argue that joining the self-attention mechanism with multiple sub-sampling processes in the hierarchical approaches results in increased loss of positional information. We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets, and show that PAT improves the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation studies to examine the impact of the different components of our proposed network.

arxiv情報

著者	Faegheh Sardari,Armin Mustafa,Philip J. B. Jackson,Adrian Hilton
発行日	2023-08-09 16:29:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PAT: Position-Aware Transformer for Dense Multi-Label Action Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー