STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

要約

最近、ビデオ質問応答モデルの急速な発展を目の当たりにしました。
ただし、ほとんどのモデルは時間的推論の観点からは単純なビデオしか処理できず、長く有益なビデオで時間的推論の質問に答えるとパフォーマンスが低下する傾向があります。
この問題に取り組むために、ビデオ質問応答のための監査可能な中間結果を備えた時空間推論モデルである STAIR を提案します。
STAIR はニューラルモジュールネットワークであり、指定された質問をいくつかのサブタスクの階層的な組み合わせに分解するプログラムジェネレーターと、これらの各サブタスクを完了する一連の軽量ニューラルモジュールが含まれています。
ニューラルモジュールネットワークは画像とテキストのタスクに関してはすでに広く研究されていますが、ビデオでの推論にはさまざまな能力が必要となるため、ニューラルモジュールネットワークをビデオに適用することは簡単な作業ではありません。
このペーパーでは、ビデオ質問応答用の基本的なビデオテキストサブタスクのセットを定義し、それらを完了するための軽量モジュールのセットを設計します。
これまでのほとんどの作品とは異なり、STAIR のモジュールは常にアテンションマップを返すのではなく、その意図に固有の中間出力を返すため、事前トレーニングされたモデルの解釈と共同作業が容易になります。
また、これらの中間出力をより正確にするために中間監視も導入します。
私たちは、STAIR のパフォーマンス、説明可能性、事前トレーニング済みモデルとの互換性、プログラムの注釈が利用できない場合の適用性を示すために、さまざまな設定の下でいくつかのビデオ質問応答データセットに対して広範な実験を実施しています。
コード: https://github.com/ yellow-binary-tree/STAIR

要約(オリジナル)

Recently we have witnessed the rapid development of video question answering models. However, most models can only handle simple videos in terms of temporal reasoning, and their performance tends to drop when answering temporal-reasoning questions on long and informative videos. To tackle this problem we propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering. STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks, and a set of lightweight neural modules to complete each of these sub-tasks. Though neural module networks are already widely studied on image-text tasks, applying them to videos is a non-trivial task, as reasoning on videos requires different abilities. In this paper, we define a set of basic video-text sub-tasks for video question answering and design a set of lightweight modules to complete them. Different from most prior works, modules of STAIR return intermediate outputs specific to their intentions instead of always returning attention maps, which makes it easier to interpret and collaborate with pre-trained models. We also introduce intermediate supervision to make these intermediate outputs more accurate. We conduct extensive experiments on several video question answering datasets under various settings to show STAIR’s performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available. Code: https://github.com/yellow-binary-tree/STAIR

arxiv情報

著者	Yueqian Wang,Yuxuan Wang,Kai Chen,Dongyan Zhao
発行日	2024-01-08 14:01:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー