LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

要約

ビデオの理解は目覚ましい進歩を遂げているにもかかわらず、ほとんどの取り組みは依然として、粗いビデオタスクまたは視覚のみのビデオタスクに限定されています。
ただし、現実世界のビデオには、一貫したストーリーラインを形成する一連のイベントを伴うオムニモーダル情報 (視覚、音声、音声) が含まれています。
きめ細かいイベント注釈を備えたマルチモーダルビデオデータの欠如と、手動によるラベル付けのコストが高いことが、包括的なオムニモダリティビデオ認識の大きな障害となっています。
このギャップに対処するために、高品質のマルチモーダルビデオフィルタリング、意味的に一貫したオムニモーダルイベント境界検出、クロスモーダル相関を意識したイベントキャプションから構成される自動パイプラインを提案します。
このようにして、8.4K の高品質の長いビデオ内に、正確な時間境界と詳細な関係を意識したキャプションを備えた 105K のオムニモーダルイベントで構成される、史上初の視覚、音声、言語イベント理解ベンチマークである LongVALE を紹介します。
さらに、LongVALE を活用して、オムニモダリティのきめ細かい時間的ビデオ理解を初めて可能にするビデオ大規模言語モデル (LLM) を可能にするベースラインを構築します。
広範な実験により、包括的なマルチモーダルビデオの理解を進める上での LongVALE の有効性と大きな可能性が実証されています。

要約(オリジナル)

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

arxiv情報

著者	Tiantian Geng,Jinrui Zhang,Qingni Wang,Teng Wang,Jinming Duan,Feng Zheng
発行日	2024-11-29 15:18:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー