Video Action Recognition with Attentive Semantic Units

要約

Visual-Language Model (VLM) は、アクションビデオ認識を大幅に進化させました。
アクションラベルのセマンティクスによって管理された最近の研究では、VLM の視覚的ブランチを適応させてビデオ表現を学習します。
これらの研究によって有効性が証明されたにもかかわらず、VLM の可能性はまだ十分に活用されていないと私たちは考えています。
これを考慮して、アクションラベルの背後に隠れているセマンティックユニット (SU) を利用し、フレーム内のきめ細かいアイテムとの相関を利用して、より正確なアクションを認識します。
SU は、身体部分、オブジェクト、シーン、モーションを含むアクションセット全体の言語記述から抽出されたエンティティです。
ビジュアルコンテンツと SU の間の連携をさらに強化するために、VLM のビジュアルブランチにマルチリージョンモジュール (MRA) を導入します。
MRA を使用すると、元のグローバルな特徴を超えて、領域を認識した視覚的特徴を認識できるようになります。
私たちの方法は、フレームの視覚的特徴を備えた関連する SU に適応的に注目して選択します。
クロスモーダルデコーダを使用すると、選択された SU は時空間ビデオ表現をデコードする役割を果たします。
要約すると、媒体としての SU は、識別能力と伝達可能性を高めることができます。
具体的には、完全教師あり学習において、私たちの手法は Kinetics-400 で 87.8% のトップ 1 精度を達成しました。
K=2 の少数ショット実験では、私たちの方法は、HMDB-51 と UCF-101 でそれぞれ +7.1% と +15.0% 、以前の最先端技術を上回りました。

要約(オリジナル)

Visual-Language Models (VLMs) have significantly advanced action video recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively.

arxiv情報

著者	Yifei Chen,Dapeng Chen,Ruijin Liu,Hao Li,Wei Peng
発行日	2023-10-10 13:31:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Action Recognition with Attentive Semantic Units

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー