Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

要約

大規模な画像とテキストのペアで事前にトレーニングされた視覚と言語モデル（VLM）を利用することは、オープンボキャブラリー視覚認識の有望なパラダイムになりつつあります。
この作品では、ビデオに自然に存在するモーションとオーディオを活用することで、このパラダイムを拡張します。
\ textbf {MOV}は、\ textbf {M} ultramodal \ textbf {O} pen- \ textbf{V}ocabularyビデオ分類のためのシンプルで効果的な方法です。
MOVでは、事前にトレーニングされたVLMのビジョンエンコーダーを直接使用し、最小限の変更でビデオ、オプティカルフロー、オーディオスペクトログラムをエンコードします。
補完的なマルチモーダル情報を集約するために、クロスモーダル融合メカニズムを設計します。
Kinetics-700とVGGSoundの実験では、フローまたはオーディオモダリティを導入すると、事前にトレーニングされたVLMや既存の方法よりもパフォーマンスが大幅に向上することが示されています。
具体的には、MOVは基本クラスの精度を大幅に向上させ、新規クラスの一般化を向上させます。
MOVは、UCFおよびHMDBのゼロショットビデオ分類ベンチマークで最先端の結果を達成し、従来のゼロショット方式とVLMに基づく最近の方式の両方を大幅に上回っています。
コードとモデルがリリースされます。

要約(オリジナル)

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

arxiv情報

著者	Rui Qian,Yeqing Li,Zheng Xu,Ming-Hsuan Yang,Serge Belongie,Yin Cui
発行日	2022-07-15 17:59:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー