KeyVideoLLM: Towards Large-scale Video Keyframe Selection

要約

最近、Web ビデオの台頭により、大規模なビデオデータセットの管理と理解の重要性がますます高まっています。
Video Large Language Model (VideoLLM) は、その強力なビデオ理解機能により、近年登場しました。
ただし、VideoLLM のトレーニングおよび推論プロセスには膨大な量のデータが必要であり、特に効率、堅牢性、有効性に関してデータ管理に重大な課題をもたらします。
この研究では、VideoLLM データを効率的、堅牢かつ効果的に管理するために設計された、テキストビデオフレームの類似性に基づくキーフレーム選択方法である KeyVideoLLM を紹介します。
具体的には、KeyVideoLLM は最大 60.9 倍という驚異的なデータ圧縮率を実現し、必要なディスク容量を大幅に削減し、その効率の高さを証明しています。
さらに、すべてのビデオ形式とスケールにわたって 100% の選択成功率を維持し、既存のキーフレーム選択方法と比較して処理速度を最大 200 倍向上させ、ハイパーパラメーターの調整を必要としません。
KeyVideoLLM は、その卓越した効率性と堅牢性を超えて、トレーニング段階と推論段階の両方でビデオ質問応答タスクにおけるモデルのパフォーマンスをさらに向上させます。
注目すべきは、多様なデータセットに対して一貫して最先端 (SoTA) の実験結果を達成したことです。

要約(オリジナル)

Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.

arxiv情報

著者	Hao Liang,Jiapeng Li,Tianyi Bai,Xijie Huang,Linzhuang Sun,Zhengren Wang,Conghui He,Bin Cui,Chong Chen,Wentao Zhang
発行日	2024-08-01 08:08:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KeyVideoLLM: Towards Large-scale Video Keyframe Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー