ViLA: Efficient Video-Language Alignment for Video Question Answering

要約

この研究では、効率的な Video-Language Alignment (ViLA) ネットワークを提案します。
当社の ViLA モデルは、効率的なフレームサンプリングと効果的なクロスモーダルアライメントの両方に統一された方法で対応します。
当社の ViLA ネットワークでは、新しいクロスモーダル蒸留 (QFormer-Distiller) モジュールとともに、新しい学習可能なテキストガイド付きフレームプロンプターを設計しています。
事前トレーニングされた大規模な画像言語モデルは、視覚的質問応答 (VQA) などの問題に対して有望な結果を示しています。
ただし、事前トレーニングされた大規模な画像言語モデルをビデオ言語の配置に適応させるときに、ビデオフレームを効率的かつ効果的にサンプリングする方法は依然として大きな課題です。
以前の研究と比較して、当社の ViLA モデルは、重要なコンテンツを含むキーフレームを選択する機能を実証し、ビデオ言語のアライメント精度を向上させると同時に、NExT-QA Temporal での推論遅延を 3.0 倍の速度で +3.3% 削減します)。
全体として、当社の ViLA ネットワークはビデオ質問応答ベンチマークで最先端の手法を上回っています。STAR インタラクションで +4.6%、STAR 平均で +2.2%、3.0 倍のスピードアップを実現し、当社の 2 フレームは SeViLA を上回っています。
VLEP データセット上の 4 フレームで 4.2 倍のスピードアップ。
コードは https://github.com/xijun-cs/ViLA で入手できます。

要約(オリジナル)

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.

arxiv情報

著者	Xijun Wang,Junbang Liang,Chun-Kai Wang,Kenan Deng,Yu Lou,Ming Lin,Shan Yang
発行日	2024-10-01 10:11:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ViLA: Efficient Video-Language Alignment for Video Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー