Everything at Once — Multi-modal Fusion Transformer for Video Retrieval

要約

ビデオデータからのマルチモーダル学習は、ゼロショット検索や分類などのタスクを可能にする人間の注釈なしで意味的に意味のある埋め込みをトレーニングできるため、最近注目を集めています。
この作業では、ビデオ、オーディオ、テキストなどの複数のモダリティ間で情報を交換し、それらを結合されたマルチモーダル表現に統合して、集約する埋め込みを取得することを学習する、マルチモーダルでモダリティにとらわれない融合変換器アプローチを提示します。
マルチモーダル時間情報。
位置やモダリティのエンコーディングなどのアドオンを明示的に除外して、単一のモダリティとモダリティのペアのすべてを一度にコンビナトリアルロスでシステムをトレーニングすることを提案します。
テスト時に、結果のモデルは任意の数の入力モダリティを処理および融合できます。
さらに、トランスフォーマーの暗黙のプロパティにより、さまざまな長さの入力を処理できます。
提案されたアプローチを評価するために、大規模な HowTo100M データセットでモデルをトレーニングし、4 つの挑戦的なベンチマークデータセットで結果として得られる埋め込み空間を評価し、ゼロショットビデオ検索とゼロショットビデオアクションローカリゼーションで最先端の結果を取得します。

要約(オリジナル)

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

arxiv情報

著者	Nina Shvetsova,Brian Chen,Andrew Rouditchenko,Samuel Thomas,Brian Kingsbury,Rogerio Feris,David Harwath,James Glass,Hilde Kuehne
発行日	2022-08-18 10:21:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Everything at Once — Multi-modal Fusion Transformer for Video Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー