Multi-Modal Few-Shot Temporal Action Detection

要約

少数ショット (FS) とゼロショット (ZS) 学習は、時間アクション検出 (TAD) を新しいクラスにスケーリングするための 2 つの異なるアプローチです。
前者は、事前トレーニング済みのビジョンモデルを、クラスごとにわずか 1 つのビデオで表される新しいタスクに適応させますが、後者は、新しいクラスのセマンティック記述を活用することにより、トレーニング例を必要としません。
この作業では、新しいマルチモダリティフューズショット (MMFS) TAD 問題を紹介します。これは、フューズショットサポートビデオと新しいクラス名を組み合わせて活用することで、FS-TAD と ZS-TAD の融合と見なすことができます。
この問題に取り組むために、新しいマルチモダリティ PromPt メタ学習 (MUPPET) メソッドをさらに紹介します。
これは、学習済みの能力を最大限に再利用しながら、事前トレーニング済みのビジョンと言語モデルを効率的に橋渡しすることによって可能になります。
具体的には、メタ学習アダプターを備えたビジュアルセマンティクストークナイザーを使用して、サポートビデオをビジョン言語モデルのテキストトークンスペースにマッピングすることにより、マルチモーダルプロンプトを構築します。
クラス内の大きな変動に対処するために、クエリ機能規制スキームをさらに設計します。
ActivityNetv1.3 と THUMOS14 での広範な実験により、当社の MUPPET が最先端の代替方法よりも多くの場合大幅に優れていることが示されています。
また、MUPPET を簡単に拡張して少数ショットのオブジェクト検出の問題に取り組み、MS-COCO データセットで最先端のパフォーマンスを実現できることも示しています。
コードは https://github.com/sauradip/MUPPET で入手できます

要約(オリジナル)

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET

arxiv情報

著者	Sauradip Nag,Mengmeng Xu,Xiatian Zhu,Juan-Manuel Perez-Rua,Bernard Ghanem,Yi-Zhe Song,Tao Xiang
発行日	2023-03-27 08:39:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Modal Few-Shot Temporal Action Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー