A Multi-level Alignment Training Scheme for Video-and-Language Grounding

要約

ビデオと言語のグラウンディングタスクを解決するための鍵は、ネットワークが 2 つのモダリティ間の接続を理解することです。
ビデオと言語の記述のペアの場合、それらの意味的な関係は、エンコーディングの類似性によって反映されます。
優れたマルチモダリティエンコーダーは、両方の入力のセマンティクスを適切にキャプチャし、埋め込み距離がセマンティックの類似性に適切に変換される共有特徴空間でそれらをエンコードできる必要があります。
この作業では、ビデオと言語の間のこのセマンティックな接続に焦点を当て、エンコーディングプロセスを直接形成するためのマルチレベルアラインメントトレーニングスキームを開発しました。
ビデオ言語アラインメントペアのグローバルおよびセグメントレベルは、高レベルのコンテキストからきめ細かいセマンティクスに至るまでの情報の類似性に基づいて設計されました。
対照的な損失は、正と負のアラインメントペア間のエンコーディングの類似性を対比するために使用され、異なるセマンティクスの情報は分離されたまま、同様の情報が共有特徴空間で密接にエンコードされるようにネットワークがトレーニングされるようにしました。
当社のマルチレベルアラインメントトレーニングは、さまざまなビデオおよび言語のグラウンディングタスクに適用できます。
タスク固有のトレーニング損失とともに、私たちのフレームワークは、複数のビデオ QA および検索データセットで、以前の最先端技術に匹敵するパフォーマンスを達成しました。

要約(オリジナル)

To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings’ similarity. A good multi-modality encoder should be able to well capture both inputs’ semantics and encode them in the shared feature space where embedding distance gets properly translated into their semantic similarity. In this work, we focused on this semantic connection between video and language, and developed a multi-level alignment training scheme to directly shape the encoding process. Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. The contrastive loss was used to contrast the encodings’ similarities between the positive and negative alignment pairs, and to ensure the network is trained in such a way that similar information is encoded closely in the shared feature space while information of different semantics is kept apart. Our multi-level alignment training can be applied to various video-and-language grounding tasks. Together with the task-specific training loss, our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.

arxiv情報

著者	Yubo Zhang,Feiyang Niu,Qing Ping,Govind Thattai
発行日	2023-02-27 17:22:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー