Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

要約

数ショットアクション認識において直面する主要な課題は、学習用の映像データが十分でないことである。この問題に対処するため、この分野の現在の手法は、主に特徴量レベルでのアルゴリズムの考案に重点を置いており、入力映像データの処理にはほとんど注意が払われていないのが現状である。さらに、既存のフレームサンプリング手法では、時間的・空間的な重要な行動情報が省略されることがあり、映像利用効率にさらに影響を与える。本論文では、この問題を解決するために、時間選択器（TS）と空間増幅器（SA）を用いてタスクに応じた空間-時間フレームサンプリングを行う、少数ショット行動認識用の新しいビデオフレームサンプラーを提案する。具体的には、まず、少ない計算量で映像全体を走査し、映像フレームの全体像を把握する。TSは、最も大きく貢献するトップTフレームを選択する役割を果たし、その後、Tフレームを選択する。また、SAでは、Saliency Mapの案内に従って、重要な領域を増幅し、各フレームの識別情報を強調する。さらに、タスク適応学習を採用し、エピソードタスクに応じてサンプリング戦略を動的に調整する。また、本提案手法の実装は微分可能であるため、多くの行動認識手法とシームレスに統合することが可能である。また、幅広い実験により、長期間の動画を含む様々なベンチマークにおいて、性能が大幅に向上することが示された。

要約(オリジナル)

A primary challenge faced in few-shot action recognition is inadequate video data for training. To address this issue, current methods in this field mainly focus on devising algorithms at the feature level while little attention is paid to processing input video data. Moreover, existing frame sampling strategies may omit critical action information in temporal and spatial dimensions, which further impacts video utilization efficiency. In this paper, we propose a novel video frame sampler for few-shot action recognition to address this issue, where task-specific spatial-temporal frame sampling is achieved via a temporal selector (TS) and a spatial amplifier (SA). Specifically, our sampler first scans the whole video at a small computational cost to obtain a global perception of video frames. The TS plays its role in selecting top-T frames that contribute most significantly and subsequently. The SA emphasizes the discriminative information of each frame by amplifying critical regions with the guidance of saliency maps. We further adopt task-adaptive learning to dynamically adjust the sampling strategy according to the episode task at hand. Both the implementations of TS and SA are differentiable for end-to-end optimization, facilitating seamless integration of our proposed sampler with most few-shot action recognition methods. Extensive experiments show a significant boost in the performances on various benchmarks including long-term videos.

arxiv情報

著者	Huabin Liu,Weixian Lv,John See,Weiyao Lin
発行日	2022-08-03 09:55:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー