Implicit Temporal Modeling with Learnable Alignment for Video Recognition

要約

タイトル：学習可能なアラインメントによる暗黙的な時間モデリングによるビデオ認識

要約：
– CLIPは、画像処理における多くのタスクで驚異的な成功を示しているが、効果的な時間モデリングをどのように拡張するかは未だにオープンかつ重要な問題である。
– 既存の分解または関節空間・時間モデリングは、効率性と性能のトレードオフを取る必要があります。
– ストレートスルーチューブ内での時間情報のモデリングは、文献において広く採用されているが、単純なフレームアライメントは、時間的な注意力を必要とせずに十分なエッセンスを提供することができます。
– そこで、本論文では、時間的なモデリングの労力を最小限に抑え、信じられないほど高い性能を実現する革新的な暗黙的可学性アラインメント（ILA）方法を提案しています。
– フレームのペアに対して、各フレームに対して相互情報豊富な領域として機能する相互作用点が予測されます。
– 相互作用点の周りの特徴を強化することによって、2つのフレームが暗黙的にアラインされます。
– アラインされた特徴は、1つのトークンにプールされ、その後の空間自己注意に活用されます。
– 本方法により、高コストまたは不十分な時間的自己注意を廃止することができる。
– ベンチマーク上の包括的な実験により、本モジュールの優越性と汎用性が示されました。
– 特に、提案されたILAは、Swin-LおよびViViT-Hに比べてFLOPsがはるかに少ないKinetics-400でのトップ1の精度が88.7％に達成されました。
– コードはhttps://github.com/Francis-Rings/ILAで公開されています。

要約(オリジナル)

Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint spatial-temporal modeling trades off between the efficiency and performance. While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention. To this end, in this paper, we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. Specifically, for a frame pair, an interactive point is predicted in each frame, serving as a mutual information rich region. By enhancing the features around the interactive point, two frames are implicitly aligned. The aligned features are then pooled into a single token, which is leveraged in the subsequent spatial self-attention. Our method allows eliminating the costly or insufficient temporal self-attention in video. Extensive experiments on benchmarks demonstrate the superiority and generality of our module. Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is released at https://github.com/Francis-Rings/ILA .

arxiv情報

著者	Shuyuan Tu,Qi Dai,Zuxuan Wu,Zhi-Qi Cheng,Han Hu,Yu-Gang Jiang
発行日	2023-04-20 17:11:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー