How can objects help action recognition?

要約

現在の最先端のビデオモデルは、ビデオクリップを時空間トークンの長いシーケンスとして処理します。
ただし、オブジェクトやビデオ全体のインタラクションを明示的にモデル化するのではなく、代わりにビデオ内のすべてのトークンを処理します。
この論文では、オブジェクトの知識を使用してより良いビデオモデルを設計する方法、つまり処理するトークンを減らし、認識精度を向上させる方法を調査します。
これは、精度を犠牲にしてトークンをドロップしたり、必要な計算量を増加させながら精度を高めたりする従来の研究とは対照的です。
まず、精度への影響を最小限に抑えながら、入力トークンのごく一部を保持できるようにする、オブジェクト主導のトークンサンプリング戦略を提案します。
そして 2 番目に、オブジェクト情報による特徴表現を強化し、全体的な精度を向上させるオブジェクト認識アテンションモジュールを提案します。
結果として得られるフレームワークは、強力なベースラインよりも少ないトークンを使用すると、より優れたパフォーマンスを実現します。
特に、SomethingElse、Something-something v2、および Epic-Kitchens の入力トークンの 30%、40%、および 60% とベースラインをそれぞれ一致させます。
モデルを使用してベースラインと同じ数のトークンを処理すると、これらのデータセットで 0.6 から 4.2 ポイント改善します。

要約(オリジナル)

Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets.

arxiv情報

著者	Xingyi Zhou,Anurag Arnab,Chen Sun,Cordelia Schmid
発行日	2023-06-20 17:56:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

How can objects help action recognition?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー