VicTR: Video-conditioned Text Representations for Activity Recognition

要約

視覚言語モデル (VLM) は、膨大な事前トレーニングデータ (つまり、画像とテキストのペアのサンプル) が利用できるため、画像領域、特にゼロショット設定で優れています。
ただし、ビデオの場合、そのようなペアのデータはそれほど豊富ではありません。
したがって、ビデオ VLM は通常、最初からトレーニングするのではなく、事前トレーニングされた画像 VLM をビデオドメインに適応させることによって設計されます。
このようなレシピはすべて、視覚的な埋め込みを時間情報 (つまり、image $\rightarrow$ video) で拡張することに依存しており、多くの場合、テキストの埋め込みは変更されないまま維持されるか、破棄されることさえあります。
この論文では、私たちは逆に、視覚情報ではなくテキストの増強に重点を置くことで、より優れたビデオ VLM を設計できると主張します。
より具体的には、ビデオ条件付きテキスト表現 (VicTR) を導入します。これは、テキスト埋め込みの形式として最適化されています。
視覚的な埋め込みにより、より柔軟な対照的な潜在空間を作成します。
私たちのモデルはさらに、視覚に基づいた補助テキスト (オブジェクトやシーンの情報など) の形式で、自由に利用できる意味情報を利用できます。
私たちは、少数ショット、ゼロショット (HMDB-51、UCF-101)、短い形式 (Kinetics-400)、および長い形式 (Charades) のアクティビティ認識ベンチマークでモデルを評価し、ビデオ VLM の中で優れたパフォーマンスを示しています。

要約(オリジナル)

Vision-Language models (VLMs) have excelled in the image-domain — especially in zero-shot settings — thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

arxiv情報

著者	Kumara Kahatapitiya,Anurag Arnab,Arsha Nagrani,Michael S. Ryoo
発行日	2024-03-29 16:56:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VicTR: Video-conditioned Text Representations for Activity Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー