LLM4VG: Large Language Models Evaluation for Video Grounding

要約

最近、研究者はビデオを処理する際の LLM の機能を調査しようと試み、いくつかのビデオ LLM モデルを提案しました。
ただし、ビデオグラウンディング (VG) を処理する LLM の能力は、モデルが特定のテキストクエリに一致するビデオ内の時間的瞬間の開始タイムスタンプと終了タイムスタンプを正確に特定する必要がある、時間に関連する重要なビデオタスクです。
文学では未踏。
このギャップを埋めるために、このホワイトペーパーでは、ビデオグラウンディングタスクにおけるさまざまな LLM のパフォーマンスを体系的に評価する LLM4VG ベンチマークを提案します。
私たちが提案した LLM4VG に基づいて、ビデオグラウンディングに関するビデオ LLM モデルの 2 つのグループを調べる広範な実験を設計します: (i) テキストとビデオのペアでトレーニングされたビデオ LLM (VidLLM と表記)、および (ii) 事前トレーニングされた LLM と組み合わせた LLM
ビデオ/画像キャプションモデルなどの視覚的説明モデル。
私たちは、VG の命令と、直接的な視覚的説明のためのキャプションベースのジェネレーターや情報強化のための VQA ベースのジェネレーターなど、さまざまな種類のジェネレーターからの説明を統合するための迅速な方法を提案します。
また、さまざまな VidLLM の包括的な比較も提供し、ビジュアルモデル、LLM、プロンプトデザインなどのさまざまな選択による影響も調査します。
私たちの実験的評価により、次の 2 つの結論が得られます。(i) 既存の VidLLM は、満足のいくビデオグラウンディングパフォーマンスの達成にはまだ程遠く、これらのモデルをさらに微調整するには、より多くの時間関連のビデオタスクを含める必要があります。(ii) 以下の組み合わせ
LLM とビジュアルモデルは、ビデオグラウンディングの予備的な能力を示しており、より信頼性の高いモデルと即時指示のさらなるガイダンスに頼ることにより、改善の大きな可能性を秘めています。

要約(オリジナル)

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

arxiv情報

著者	Wei Feng,Xin Wang,Hong Chen,Zeyang Zhang,Zihan Song,Yuwei Zhou,Wenwu Zhu
発行日	2023-12-28 13:02:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLM4VG: Large Language Models Evaluation for Video Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー