Learning to Locate Visual Answer in Video Corpus Using Question

要約

自然言語の質問を使用して、トリミングされていないセグメント化されていない教育ビデオの大規模なコレクションで視覚的な回答を見つけることを目的とした、ビデオコーパスの視覚的な回答のローカリゼーション (VCVAL) という名前の新しいタスクを紹介します。
このタスクには、視覚と言語の相互作用、ビデオの検索、パッセージの理解、視覚的な回答のローカリゼーションなど、さまざまなスキルが必要です。
この論文では、ビデオコーパスの検索と視覚的な回答のローカリゼーションサブタスクを共同でトレーニングする、VCVALのクロスモーダルコントラストグローバルスパン（CCGS）メソッドを提案します。
より正確には、最初に要素ごとの視覚情報を事前トレーニング済み言語モデルに追加することにより、ビデオの質問と回答のセマンティックを強化し、次に融合情報を介して新しいグローバルスパン予測子を設計し、視覚的な回答ポイントを見つけます。
グローバルスパン対比学習を採用して、グローバルスパンマトリックスを使用してポジティブサンプルとネガティブサンプルからスパンポイントをソートします。
VCVAL タスクがベンチマークされる MedVidCQA という名前のデータセットを再構築しました。
実験結果は、提案された方法が、ビデオコーパスの検索と視覚的な回答のローカリゼーションのサブタスクの両方で、他の競合する方法よりも優れていることを示しています。
最も重要なことは、大規模な実験で詳細な分析を行い、教育ビデオを理解するための新しい道を開き、さらなる研究を導くことです.

要約(オリジナル)

We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed, unsegmented instructional videos using a natural language question. This task requires a range of skills – the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks. More precisely, we first enhance the video question-answer semantic by adding element-wise visual information into the pre-trained language model, and then design a novel global-span predictor through fusion information to locate the visual answer point. The global-span contrastive learning is adopted to sort the span point from the positive and negative samples with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.

arxiv情報

著者	Bin Li,Yixuan Weng,Bin Sun,Shutao Li
発行日	2022-10-13 15:48:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning to Locate Visual Answer in Video Corpus Using Question

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー