Audiovisual Masked Autoencoders

要約

ビデオにすでに存在する視聴覚情報を活用して、自己教師あり表現学習を改善することはできますか?
この質問に答えるために、私たちは自然言語と画像の理解における同様の手法の成功を動機として、マスクされた自動エンコーディングフレームワーク内のさまざまな事前トレーニングアーキテクチャと目的を研究しています。
我々は、オーディオビジュアルの下流分類タスクにおいて、最先端の VGGSound および AudioSet を超える大幅な改善を達成できることを示します。
さらに、単一の視聴覚事前トレーニングモデルを使用して、複数の単峰性の下流タスクに視聴覚事前トレーニングスキームを活用できます。
さらに、このデータセット専用の事前トレーニングを行わずに、Epic Kitchens 上で最先端のオーディオビジュアル結果を達成することで、表現の転送可能性を実証します。

要約(オリジナル)

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

arxiv情報

著者	Mariana-Iuliana Georgescu,Eduardo Fonseca,Radu Tudor Ionescu,Mario Lucic,Cordelia Schmid,Anurag Arnab
発行日	2023-07-28 12:22:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Audiovisual Masked Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー