TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

要約

テレビクリップなどの複雑でマルチモーダルなコンテンツに対して質問応答を実行するのは困難です。
その理由の 1 つは、現在のビデオ言語モデルが単一モダリティ推論に依存しており、長い入力に対するパフォーマンスが低下し、相互互換性に欠けているためです。
私たちは、初のマルチモーダル含意ツリー生成装置である TV-TREES を提案します。
TV-TREES は、ビデオによって直接含意される単純な前提とより高いレベルの結論の間の含意関係のツリーを生成することにより、解釈可能な共同モダリティ推論を促進するビデオ理解へのアプローチとして機能します。
次に、そのような方法の推論品質を評価するために、マルチモーダル含意ツリー生成のタスクを導入します。
困難な TVQA データセットに対する私たちの手法の実験結果は、フルビデオクリップでの解釈可能な最先端のゼロショットパフォーマンスを実証し、ブラックボックス手法との対照的な両方の長所を示しています。

要約(オリジナル)

It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method’s experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best of both worlds contrast to black-box methods.

arxiv情報

著者	Kate Sanders,Nathaniel Weir,Benjamin Van Durme
発行日	2024-02-29 18:57:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー