Zorro: the masked multimodal transformer

要約

注意ベースのモデルは、マルチモーダル処理に適しています。これは、複数のモダリティからの入力を連結して単一のバックボーンネットワークに供給することができるため、フュージョンエンジニアリングがほとんど必要ないためです。
ただし、結果の表現はネットワーク全体で完全に絡み合っており、常に望ましいとは限りません。学習では、対照的な視聴覚自己教師あり学習では、動作するために独立した音声と視覚の機能が必要です。そうしないと、学習が崩壊します。
推論では、オーディオビジュアルモデルの評価は、オーディオのみまたはビデオのみのベンチマークで可能になるはずです。
このホワイトペーパーでは、Zorro を紹介します。これは、マスクを使用して各モダリティからの入力を Transformer 内でルーティングする方法を制御し、表現の一部をモダリティピュアに保ちます。
この手法を 3 つの一般的なトランスベースのアーキテクチャ (ViT、Swin、および HiP) に適用し、対照的な事前トレーニングにより、Zorro がマルチモーダルタスク (AudioSet および VGGSound) の最も関連するベンチマークで最先端の結果を達成することを示します。
さらに、結果として得られるモデルは、Kinetics-400 や ESC-50 などのビデオとオーディオの両方のベンチマークでユニモーダル推論を実行できます。

要約(オリジナル)

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network – thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

arxiv情報

著者	Adrià Recasens,Jason Lin,Joāo Carreira,Drew Jaegle,Luyu Wang,Jean-baptiste Alayrac,Pauline Luc,Antoine Miech,Lucas Smaira,Ross Hemsley,Andrew Zisserman
発行日	2023-01-23 17:51:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zorro: the masked multimodal transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー