Fostering Video Reasoning via Next-Event Prediction

要約

次のトークン予測は、LLMSの推論を可能にする基礎学習タスクとして機能します。
しかし、MLLMSにビデオ入力に対する一時的な推論能力を装備することを目指すとき、学習タスクはどうあるべきでしょうか？
ビデオの質問への回答などの既存のタスクは、多くの場合、人間からの注釈やより強いMLLMに依存していますが、ビデオキャプションは空間情報に一時的な推論を巻き込む傾向があります。
このギャップに対処するために、次のイベント予測（NEP）を提案します。これは、将来のビデオセグメントを豊かで自己評価された信号として活用して、時間的推論を促進することを提案します。
各ビデオを過去および将来のフレームにセグメント化します。MLLMは過去のフレームを入力として取得し、将来のフレームから派生したイベントの概要を予測し、それによりモデルがタスクを完了するために一時的に推論するよう奨励します。
このタスクをサポートするために、V1-33Kをキュレートします。これは、多様な現実世界のシナリオにまたがる33,000個の自動的に抽出されたビデオセグメントを含むデータセットです。
さらに、一時的な推論に対する効果を研究するために、さまざまなビデオ指導調整戦略を探ります。
進捗状況を評価するために、未来のベンチを導入して、目に見えない将来のイベントを予測する際の一貫性を評価します。
実験では、NEPがMLLMの時間的推論を促進するためのスケーラブルで効果的なトレーニングパラダイムを提供することを検証します。

要約(オリジナル)

Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

arxiv情報

著者	Haonan Wang,Hongfu Liu,Xiangyan Liu,Chao Du,Kenji Kawaguchi,Ye Wang,Tianyu Pang
発行日	2025-05-28 15:13:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fostering Video Reasoning via Next-Event Prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー