Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

要約

CLIP (対照的言語イメージ事前トレーニング) の目覚ましい成功に基づいて、最近の先駆者の研究では、強力な CLIP をビデオデータに適応させ、オープンな語彙動作認識のための効率的かつ効果的なビデオ学習者を導くことが提案されています。
人間が多様な環境で行動を行うことに着想を得た私たちの研究では、CLIP ベースのビデオ学習者は、トレーニング中に遭遇したことのないビデオ領域を効果的に一般化できるか?という興味深い疑問を掘り下げています。
これに答えるために、XOV-Action というクロスドメインのオープンボキャブラリーアクション認識ベンチマークを確立し、さまざまなタイプのドメインギャップの下で 5 つの最先端の CLIP ベースのビデオ学習器の包括的な評価を実施します。
この評価では、これまでの方法では目に見えないビデオドメインでの動作認識パフォーマンスが限られていることが実証され、クロスドメインのオープン語彙動作認識タスクの潜在的な課題が明らかになりました。
この論文では、このタスクの 1 つの重要な課題、つまりシーンバイアスに焦点を当て、それに応じて新しいシーンを意識したビデオとテキストの位置合わせ方法を提供します。
私たちの重要なアイデアは、シーンにエンコードされたテキスト表現とは別にビデオ表現を区別することであり、ドメイン全体でアクションを認識するためにシーンに依存しないビデオ表現を学習することを目的としています。
広範な実験により、私たちの方法の有効性が実証されています。
ベンチマークとコードは https://github.com/KunyuLin/XOV-Action/ で入手できます。

要約(オリジナル)

Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.

arxiv情報

著者	Kun-Yu Lin,Henghui Ding,Jiaming Zhou,Yu-Ming Tang,Yi-Xing Peng,Zhilin Zhao,Chen Change Loy,Wei-Shi Zheng
発行日	2024-05-24 14:47:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー