Elysium: Exploring Object-level Perception in Videos via MLLM

要約

マルチモーダル大規模言語モデル (MLLM) は、静止画像内のオブジェクトを認識する能力を実証していますが、オブジェクト追跡などのビデオ関連タスクへの応用については、まだ研究が進んでいません。
この探索の欠如は主に 2 つの重要な課題によるものです。
まず、複数のフレームにわたってオブジェクトを認識し、フレーム間の関係を理解する機能を MLLM に装備するには、大規模なビデオデータセットに対する広範な事前トレーニングが必要です。
第 2 に、大規模言語モデル (LLM) のコンテキストウィンドウ内で多数のフレームを処理すると、かなりの計算負荷がかかる可能性があります。
最初の課題に対処するために、私たちは ElysiumTrack-1M を導入します。ElysiumTrack-1M は、Referring Single Object Tracking (RSOT) と Video Referring Expression Generation (Video-REG) という新しいタスクと組み合わせた大規模ビデオデータセットです。
ElysiumTrack-1M には、対応するオブジェクトボックスと説明を含む 127 万個の注釈付きビデオフレームが含まれています。
このデータセットを活用して、MLLM のトレーニングを実施し、2 番目の課題に取り組むためのトークン圧縮モデル T-Selector を提案します。
私たちが提案するアプローチである Elysium: Exploring Object-level Perception in Videos via MLLM は、追加のプラグインやエキスパートモデルを必要とせずに、ビデオ内でオブジェクトレベルのタスクを実行する初めての試みを行う、エンドツーエンドのトレーニング可能な MLLM です。

要約(オリジナル)

Multi-modal Large Language Models (MLLMs) have demonstrated their ability to perceive objects in still images, but their application in video-related tasks, such as object tracking, remains understudied. This lack of exploration is primarily due to two key challenges. Firstly, extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships. Secondly, processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden. To address the first challenge, we introduce ElysiumTrack-1M, a large-scale video dataset paired with novel tasks: Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG). ElysiumTrack-1M contains 1.27 million annotated video frames with corresponding object boxes and descriptions. Leveraging this dataset, we conduct training of MLLMs and propose a token-compression model T-Selector to tackle the second challenge. Our proposed approach, Elysium: Exploring Object-level Perception in Videos via MLLM, is an end-to-end trainable MLLM that makes the first attempt to conduct object-level tasks in videos without requiring any additional plug-in or expert models.

arxiv情報

著者	Han Wang,Yanjie Wang,Yongjie Ye,Yuxiang Nie,Can Huang
発行日	2024-03-25 09:17:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Elysium: Exploring Object-level Perception in Videos via MLLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー