Tell me what you see: A zero-shot action recognition method based on natural language descriptions

要約

この論文では、ゼロショットアクション認識への新しいアプローチを紹介します。
最近の研究では、驚くべきパフォーマンスでビデオから意味論的な情報を取得するためのオブジェクトの検出と分類が検討されています。
彼らからインスピレーションを得て、私たちはビデオキャプション手法を使用して、オブジェクト、シーン、人間、およびそれらの関係に関する意味情報を抽出することを提案します。
私たちの知る限り、これはビデオとラベルの両方を説明文で表現した最初の作品です。
具体的には、ビデオキャプションメソッドによって生成された文章を使用してビデオを表現したり、インターネット上の検索エンジンを通じて取得した文書から抽出した文章を使用したクラスを使用したりします。
これらの表現を使用して、複数のテキストデータセットの言い換えタスクで事前トレーニングされた BERT ベースのエンベッダーを使用して、共有意味空間を構築します。
視覚情報と意味情報の両方をこの空間に投影することは、それらが文であるため簡単であり、最近傍ルールを使用した分類が可能になります。
ビデオとラベルを文章で表現すると、ドメイン適応の問題が軽減されることを示します。
さらに、単語ベクトルは説明の意味埋め込み空間の構築には適さないことを示します。
私たちの手法は、UCF101 データセットでの最先端のパフォーマンスを 3.3 p.p. 上回っています。
TruZe プロトコルでは精度が高く、従来のプロトコル (0/50\% – トレーニング/テスト分割) では UCF101 と HMDB51 データセットの両方で競合する結果を達成しました。
私たちのコードは https://github.com/valterlej/zsarcap で入手できます。

要約(オリジナル)

This paper presents a novel approach to Zero-Shot Action Recognition. Recent works have explored the detection and classification of objects to obtain semantic information from videos with remarkable performance. Inspired by them, we propose using video captioning methods to extract semantic information about objects, scenes, humans, and their relationships. To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences. More specifically, we represent videos using sentences generated via video captioning methods and classes using sentences extracted from documents acquired through search engines on the Internet. Using these representations, we build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets. The projection of both visual and semantic information onto this space is straightforward, as they are sentences, enabling classification using the nearest neighbor rule. We demonstrate that representing videos and labels with sentences alleviates the domain adaptation problem. Additionally, we show that word vectors are unsuitable for building the semantic embedding space of our descriptions. Our method outperforms the state-of-the-art performance on the UCF101 dataset by 3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50\% – training/testing split). Our code is available at https://github.com/valterlej/zsarcap.

arxiv情報

著者	Valter Estevam,Rayson Laroca,David Menotti,Helio Pedrini
発行日	2023-09-11 17:57:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー