Wolf: Dense Video Captioning with a World Summarization Framework

要約

正確なビデオキャプションのための世界要約フレームワークであるWolfを提案します。
Wolfは、視覚モデル（VLMS）の相補的な強さを活用して、専門家の混合アプローチを採用する自動キャプションフレームワークです。
画像モデルとビデオモデルの両方を利用することにより、フレームワークはさまざまなレベルの情報をキャプチャし、効率的に要約します。
私たちのアプローチを適用して、ビデオの理解、自動ラベル、キャプションを強化することができます。
キャプションの品質を評価するために、LLMベースのメトリックであるCapscoreを導入して、グラウンドトゥルースキャプションと比較して生成されたキャプションの類似性と品質を評価します。
さらに、包括的な比較を促進するために、自律運転、一般的なシーン、ロボット工学の3つのドメインに4つのヒトが解決したデータセットを構築します。
Wolfは、研究コミュニティ（Vila1.5、Cogagent）および商用ソリューション（Gemini-Pro-1.5、GPT-4V）からの最先端のアプローチと比較して、優れたキャプションパフォーマンスを達成することを示しています。
たとえば、GPT-4Vと比較して、Wolfは、挑戦的なドライビングビデオで、品質ごとに55.6％、類似性の両方を77.4％改善します。
最後に、ビデオキャプションのベンチマークを確立し、ビデオの理解、キャプション、およびデータの調整の進歩を加速することを目指して、リーダーボードを導入します。
Webページ：https：//wolfv0.github.io/。

要約(オリジナル)

We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Webpage: https://wolfv0.github.io/.

arxiv情報

著者	Boyi Li,Ligeng Zhu,Ran Tian,Shuhan Tan,Yuxiao Chen,Yao Lu,Yin Cui,Sushant Veer,Max Ehrlich,Jonah Philion,Xinshuo Weng,Fuzhao Xue,Linxi Fan,Yuke Zhu,Jan Kautz,Andrew Tao,Ming-Yu Liu,Sanja Fidler,Boris Ivanovic,Trevor Darrell,Jitendra Malik,Song Han,Marco Pavone
発行日	2025-03-20 17:56:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Wolf: Dense Video Captioning with a World Summarization Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー