Invariant Grounding for Video Question Answering

要約

Video Question Answering (VideoQA) は、ビデオに関する質問に回答するタスクである。その核心は、ビデオの視覚的なシーンと質問の言語的な意味との間の整合性を理解し、答えを導き出すことです。主要なVideoQAモデルでは、典型的な学習目的である経験的リスク最小化（ERM）は、ビデオと質問のペアと答えの間の表面的な相関をアラインメントとしてとらえるものである。しかし、ERMは質問と無関係なシーンと回答の相関を過剰に利用する傾向があり、質問に重要なシーンの因果関係を検査しないため、問題となることがあります。その結果、VideoQAモデルは信頼性の低い推論に悩まされる。本研究では、まずVideoQAの因果関係を調べ、偽の相関関係を排除するために不変な基底が重要であることを主張する。この目的のために、我々は新しい学習フレームワーク、Invariant Grounding for VideoQA (IGV)を提案し、質問-批判シーンとその回答との因果関係が、補完物への異なる介入に対して不変であることを根拠づける。IGVにより、VideoQAモデルは偽相関の負の影響から回答プロセスを強制的に保護し、推論能力を大幅に向上させることができる。3つのベンチマークデータセットでの実験により、精度、視覚的説明可能性、一般化能力において、IGVが主要なベースラインより優れていることが検証された。

要約(オリジナル)

Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers as the alignments. However, ERM can be problematic, because it tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes. As a result, the VideoQA models suffer from unreliable reasoning. In this work, we first take a causal look at VideoQA and argue that invariant grounding is the key to ruling out the spurious correlations. Towards this end, we propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, the VideoQA models are forced to shield the answering process from the negative influence of spurious correlations, which significantly improves the reasoning ability. Experiments on three benchmark datasets validate the superiority of IGV in terms of accuracy, visual explainability, and generalization ability over the leading baselines.

arxiv情報

著者	Yicong Li,Xiang Wang,Junbin Xiao,Wei Ji,Tat-Seng Chua
発行日	2022-06-06 04:37:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Invariant Grounding for Video Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー