Real-time Online Video Detection with Temporal Smoothing Transformers

要約

ビデオのすべてのフレームでのオブジェクトとそのアクションに関するビデオ認識の理由をストリーミングします。
優れたストリーミング認識モデルは、ビデオの長期的なダイナミクスと短期的な変化の両方をキャプチャします。
残念ながら、ほとんどの既存の方法では、考慮されるダイナミクスの長さに応じて、計算の複雑さが線形または二次的に増加します。
この問題は、トランスフォーマーベースのアーキテクチャで特に顕著です。
この問題に対処するために、カーネルのレンズを通してビデオトランスフォーマーの相互注意を再定式化し、2 種類の時間平滑化カーネル (ボックスカーネルまたはラプラスカーネル) を適用します。
結果として得られるストリーミングアテンションは、フレームからフレームへと多くの計算を再利用し、フレームごとに一定時間更新するだけで済みます。
このアイデアに基づいて、一定のキャッシングと計算オーバーヘッドで任意に長い入力を取り込む一時平滑化トランスフォーマーである TeSTra を構築します。
具体的には、ストリーミング設定で 2,048 フレームを使用して、同等のスライディングウィンドウベースのトランスフォーマーよりも $6\times$ 速く実行されます。
さらに、時間スパンの増加のおかげで、TeSTra は THUMOS’14 および EPIC-Kitchen-100 という 2 つの標準オンラインアクション検出およびアクション予測データセットで最先端の結果を達成します。
TeSTra のリアルタイムバージョンは、THUMOS’14 データセットに対する以前のアプローチの 1 つを除いて、すべてより優れています。

要約(オリジナル)

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead. Specifically, it runs $6\times$ faster than equivalent sliding-window based transformers with 2,048 frames in a streaming setting. Furthermore, thanks to the increased temporal span, TeSTra achieves state-of-the-art results on THUMOS’14 and EPIC-Kitchen-100, two standard online action detection and action anticipation datasets. A real-time version of TeSTra outperforms all but one prior approaches on the THUMOS’14 dataset.

arxiv情報

著者	Yue Zhao,Philipp Krähenbühl
発行日	2022-09-19 17:59:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Real-time Online Video Detection with Temporal Smoothing Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー