Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

要約

ビデオ質問応答（VideoQA）は、トレーニングに多様なマルチモーダルデータを必要とする複雑なタスクです。
ただし、ビデオの質問と回答を手動で注釈するのは面倒で、スケーラビリティが妨げられます。
この問題に取り組むために、最近の方法では、視覚的な質問と回答の手動注釈を付けないゼロショット設定を検討しています。
特に、有望なアプローチは、Webスケールのテキストのみのデータで事前トレーニングされた凍結自己回帰言語モデルをマルチモーダル入力に適応させます。
対照的に、ここでは凍結双方向言語モデル（BiLM）に基づいて構築し、そのようなアプローチがゼロショットVideoQAのより強力で安価な代替手段を提供することを示します。
特に、（i）光トレーニング可能なモジュールを使用して視覚入力と凍結BiLMを組み合わせ、（ii）Webスクレイピングされたマルチモーダルデータを使用してそのようなモジュールをトレーニングし、最後に（iii）マスクされた言語を介してゼロショットVideoQA推論を実行します
モデリング。マスクされたテキストが特定の質問に対する答えです。
私たちが提案するアプローチであるFrozenBiLMは、LSMDC-FiB、iVQA、MSRVTT-QA、MSVD-QA、ActivityNet-QA、TGIF-FrameQAなどのさまざまなデータセットで、ゼロショットVideoQAの最新技術を大幅に上回っています。
How2QAとTVQA。
また、数ショットで完全に監視された設定で競争力のあるパフォーマンスを示します。
私たちのコードとモデルはhttps://antoyang.github.io/frozenbilm.htmlで公開されます。

要約(オリジナル)

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.

arxiv情報

著者	Antoine Yang,Antoine Miech,Josef Sivic,Ivan Laptev,Cordelia Schmid
発行日	2022-06-16 13:18:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー