End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

要約

Video Question Answering (VideoQA) は、マルチメディア処理の分野における挑戦的なフロンティアとして浮上しており、ビジュアルモダリティとテキストモダリティの間の複雑なインタラクションが必要です。
単純にフレームを均一にサンプリングしたり、フレームレベルの視覚的特徴を無差別に集約したりするだけでは、VideoQA を適切に実行するためにビデオの微妙な関連性のあるコンテキストをキャプチャするには不十分なことがよくあります。
これらの問題を軽減するために、効果的かつ効率的な VideoQA を実現するためのフレーム選択戦略を備えた新しい VideoQA フレームワークである VidF4 を提案します。
ビデオ上の特定の質問に対する各フレームの重要性を評価するために、質問の関連性とフレーム間の類似性の両方を考慮する 3 つのフレームスコアリングメカニズムを提案します。
さらに、フレームセレクターと応答ジェネレーターのエンドツーエンドのトレーニングを容易にする、微分可能な適応フレームサンプリングメカニズムを設計します。
広く採用されている 3 つのベンチマークにわたる実験結果は、当社のモデルが既存の VideoQA 手法を常に上回っており、NExT-QA (+0.3%)、STAR (+0.9%)、および TVQA (+1.0%) 全体で新しい SOTA を確立していることを示しています。
さらに、定量的分析と定性的分析の両方を通じて、各設計選択の有効性を検証します。

要約(オリジナル)

Video Question Answering (VideoQA) has emerged as a challenging frontier in the field of multimedia processing, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA (+1.0%). Furthermore, through both quantitative and qualitative analyses, we validate the effectiveness of each design choice.

arxiv情報

著者	Jianxin Liang,Xiaojun Meng,Yueqian Wang,Chang Liu,Qun Liu,Dongyan Zhao
発行日	2024-07-23 14:56:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー