Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

要約

時間アクションローカリゼーション (TAL) では、さまざまな期間と複雑なコンテンツのアクションを予測するために、長い形式の推論が必要です。
GPU メモリが限られている場合、長いビデオで TAL をエンドツーエンド (つまり、ビデオから予測まで) トレーニングすることは、大きな課題です。
ほとんどのメソッドは、ローカリゼーションの問題のためにそれらを最適化せずに、事前に抽出された機能でしかトレーニングできないため、ローカリゼーションのパフォーマンスが制限されます。
この作業では、TAL ネットワークの可能性を拡張するために、リバーシブル TAL 用に事前トレーニング済みのビデオバックボーンを再配線する新しいエンドツーエンド方式 Re2TAL を提案します。
Re2TAL は可逆モジュールを使用してバックボーンを構築します。このバックボーンでは、トレーニング中にかさばる中間アクティベーションをメモリからクリアできるように、入力を出力から復元できます。
単一タイプの可逆モジュールを設計する代わりに、ネットワーク再配線メカニズムを提案し、パラメーターを変更せずに残りの接続を持つモジュールを可逆モジュールに変換します。
これには 2 つの利点があります。(1) 多種多様な可逆ネットワークが、既存および将来のモデル設計からも簡単に取得できます。(2) 可逆モデルは、元の非モデルの事前トレーニング済みパラメーターを再利用するため、トレーニングの労力がはるかに少なくて済みます。
可逆バージョン。
RGB モダリティのみを使用する Re2TAL は、ActivityNet-v1.3 で 37.01% の平均 mAP に達し、新しい最先端の記録であり、THUMOS-14 では tIoU=0.5 で mAP 64.9% に達し、他のすべての RGB のみよりも優れています。
メソッド。

要約(オリジナル)

Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end (i.e., from videos to predictions) on long videos is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL, only using the RGB modality, reaches 37.01% average mAP on ActivityNet-v1.3, a new state-of-the-art record, and mAP 64.9% at tIoU=0.5 on THUMOS-14, outperforming all other RGB-only methods.

arxiv情報

著者	Chen Zhao,Shuming Liu,Karttikeya Mangalam,Bernard Ghanem
発行日	2023-03-28 08:48:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー