Revisiting the ‘Video’ in Video-Language Understanding

要約

単一の画像から理解できることを超えて、ビデオタスクがビデオに独自に適している理由は何ですか？
自己監視画像言語モデルの最近の進歩に基づいて、ビデオおよび言語タスクのコンテキストでこの質問を再検討します。
画像レベルの理解によって制約されたマルチモーダルモデルのベースライン精度に強い限界を提供するビデオ言語分析の新しいモデルであるアテンポラルプローブ（ATP）を提案します。
このモデルを、ビデオの質問応答やテキストからビデオへの検索など、標準の識別可能なビデオおよび言語タスクに適用することにより、現在のビデオ言語ベンチマークの制限と可能性を特徴付けます。
最近の大規模なビデオ言語モデルと比較したり、ビデオレベルの理解を深めることを目的としたコンテキストであっても、強力なパフォーマンスや最先端のパフォーマンスを実現するために、イベントのテンポラリティを理解する必要がないことがよくあります。
また、ATPがビデオ言語データセットとモデル設計の両方をどのように改善できるかを示します。
ATPを活用して、時間的に困難なデータがより集中しているデータセットサブセットをより適切に解きほぐし、因果関係と時間的理解のためのベンチマークの有効性を向上させる手法について説明します。
さらに、ATPを完全なビデオレベルの時間モデルに効果的に統合することで、効率と最先端の精度を向上できることを示します。

要約(オリジナル)

What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.

arxiv情報

著者	Shyamal Buch,Cristóbal Eyzaguirre,Adrien Gaidon,Jiajun Wu,Li Fei-Fei,Juan Carlos Niebles
発行日	2022-06-03 17:57:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting the ‘Video’ in Video-Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー