SITAR: Semi-supervised Image Transformer for Action Recognition

要約

視覚データに注釈を付けるのは面倒なだけでなく、機密扱いであるためコストがかかる可能性があるため、限られたラベル付きビデオのセットからアクションを認識することは依然として課題です。
さらに、このためにディープ $3$D トランスフォーマーを使用して時空間データを処理すると、計算が大幅に複雑になる可能性があります。
このペーパーでは、少数のラベル付きビデオとラベルなしビデオのコレクションを効率的に計算して利用することにより、半教師あり設定でビデオアクション認識に取り組むことが目的です。
具体的には、入力映像の複数フレームを行・列形式に並べ替えてスーパーイメージを構築します。
その後、ラベルのないサンプルの膨大なプールを利用し、エンコードされたスーパーイメージに対して対照学習を採用します。
私たちが提案するアプローチは、同じビデオから発生する時間的に拡張されたスーパーイメージの表現を生成するために 2 つの経路を採用します。
具体的には、2D 画像変換器を利用して表現を生成し、対照的な損失関数を適用して、異なるビデオの表現間の類似性を最小限に抑えながら、同一のビデオの表現を最大化します。
私たちの手法は、計算コストを大幅に削減しながら、さまざまなベンチマークデータセットにわたる半教師ありアクション認識に対する既存の最先端のアプローチと比較して優れたパフォーマンスを示します。

要約(オリジナル)

Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep $3$D transformers for this can introduce significant computational complexity. In this paper, our objective is to address video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos along with a collection of unlabeled videos in a compute efficient manner. Specifically, we rearrange multiple frames from the input videos in row-column form to construct super images. Subsequently, we capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images. Our proposed approach employs two pathways to generate representations for temporally augmented super images originating from the same video. Specifically, we utilize a 2D image-transformer to generate representations and apply a contrastive loss function to minimize the similarity between representations from different videos while maximizing the representations of identical videos. Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition across various benchmark datasets, all while significantly reducing computational costs.

arxiv情報

著者	Owais Iqbal,Omprakash Chakraborty,Aftab Hussain,Rameswar Panda,Abir Das
発行日	2024-09-04 17:49:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SITAR: Semi-supervised Image Transformer for Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー