VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering

要約

この研究では、フレームプロンプトおよび蒸留 (VLAP) ネットワークを介した効率的なビデオ言語調整を提案します。
当社の VLAP モデルは、効率的なフレームサンプリングと効果的なクロスモーダルアライメントの両方に統一された方法で対応します。
当社の VLAP ネットワークでは、新しい学習可能な質問認識フレームプロンプターと新しいクロスモーダル蒸留 (QFormer-Distiller) モジュールを設計しています。
事前にトレーニングされた大規模な画像言語モデルは、視覚的な質問応答などの問題に関して有望な結果を示しています。
ただし、事前トレーニングされた大規模な画像言語モデルをビデオ言語の配置に適応させるときに、画像フレームを効率的かつ効果的にサンプリングする方法は依然として大きな課題です。
以前の研究と比較して、当社の VLAP モデルは、重要なコンテンツを含むキーフレームを選択する機能を実証し、推論レイテンシーを削減しながらビデオ言語の位置合わせの精度を向上させます (NExT-QA Temporal で +3.3%、3.0 倍の高速化)。
全体として、当社の VLAP ネットワークは現状を上回っています (例: 3.0 倍の速度向上で STAR インタラクションで +4.6%、STAR 平均で +2.2%、VLEP では 2 フレームが SeViLA 4 フレームの性能を上回って 4.2 倍の速度)
-ビデオ質問応答ベンチマークの最先端の手法。

要約(オリジナル)

In this work, we propose an efficient Video-Language Alignment via Frame-Prompting and Distilling (VLAP) network. Our VLAP model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our VLAP network, we design a new learnable question-aware Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering. However, how to efficiently and effectively sample image frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our VLAP model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency (+3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our VLAP network outperforms (e.g. +4.6% on STAR Interaction and +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on VLEP with 4.2X speed up) the state-of-the-art methods on the video question-answering benchmarks.

arxiv情報

著者	Xijun Wang,Junbang Liang,Chun-Kai Wang,Kenan Deng,Yu Lou,Ming Lin,Shan Yang
発行日	2024-02-15 10:57:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLAP: Efficient Video-Language Alignment via Frame Prompting and Distilling for Video Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー