LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

要約

ビデオ理解における印象的な進歩にもかかわらず、ほとんどの努力は粗いまたは視覚的なみのビデオタスクに限定されたままです。
ただし、実世界のビデオには、一連のイベントがまとまりのあるストーリーラインを形成するオムニモーダル情報（ビジョン、オーディオ、スピーチ）が含まれます。
きめ細かいイベント注釈と手動ラベルの高コストを備えたマルチモーダルビデオデータの欠如は、包括的なオムニモダリティビデオ認識に対する大きな障害です。
このギャップに対処するために、高品質のマルチモーダルビデオフィルタリング、意味的にコヒーレントなオムニモーダルイベント境界検出、およびクロスモーダル相関アウェアイベントキャプションで構成される自動パイプラインを提案します。
このようにして、正確な時間的境界と8.4kの高品質の長いビデオ内で詳細な関係認識キャプションを備えた105Kオムニモーダルイベントを含む史上初のビジョンオーディオ言語イベントのベンチマークであるロングベールを紹介します。
さらに、ロングベールを活用して、オムニモダリティの細い粒度の一時的なビデオ理解のためのビデオ大手言語モデル（LLM）を初めて有効にするベースラインを構築します。
広範な実験は、包括的なマルチモーダルビデオ理解を進める上でのロングベールの有効性と大きな可能性を示しています。

要約(オリジナル)

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

arxiv情報

著者	Tiantian Geng,Jinrui Zhang,Qingni Wang,Teng Wang,Jinming Duan,Feng Zheng
発行日	2025-03-20 11:55:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー