EventVL: Understand Event Streams via Multimodal Large Language Model

要約

イベントベースのビジョン言語モデル（VLM）は、最近、実用的なビジョンタスクについて十分に進歩しました。
ただし、これらの作品のほとんどは、イベントストリームの十分なセマンティクスとコンテキストを明示的に理解するモデルを妨害する従来の認識タスクに焦点を当てるためにクリップを利用するだけです。
欠陥に対処するために、明示的なセマンティック理解のための最初の生成イベントベースのMLLM（マルチモーダル大手言語モデル）フレームワークであるEventVLを提案します。
具体的には、さまざまなモダリティセマンティクスを接続するためのデータギャップを橋渡しするために、最初に大規模なイベント/ビデオテキストデータセットに注釈を付けます。
または人間の動き。
その後、イベントの時空間表現を設計して、イベントストリームを多様に集約してセグメント化することにより、包括的な情報を完全に調査します。
コンパクトなセマンティックスペースをさらに促進するために、イベントのまばらなセマンティックスペースを改善および完全にするために、動的セマンティックアライメントが導入されます。
広範な実験では、イベントのキャプションとシーンの説明生成タスクで、既存のMLLMベースラインを大幅に上回ることができることが示されています。
私たちの研究がイベントビジョンコミュニティの発展に貢献できることを願っています。

要約(オリジナル)

The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.

arxiv情報

著者	Pengteng Li,Yunfan Lu,Pinghao Song,Wuyang Li,Huizai Yao,Hui Xiong
発行日	2025-01-23 14:37:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EventVL: Understand Event Streams via Multimodal Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー