Video-R1: Reinforcing Video Reasoning in MLLMs

要約

ルールベースの強化学習（RL）を通じて推論能力を引き出すことにおけるDeepseek-R1の成功に触発され、Multimodal Laging Languageモデル（MLLM）内のビデオ推論を誘発するためのR1パラダイムを体系的に調査する最初の試みとしてVideo-R1を紹介します。
ただし、GRPOアルゴリズムを使用したRLトレーニングをビデオ推論に直接適用すると、2つの主要な課題が示されます。（i）ビデオ推論のための時間モデリングの欠如、および（ii）高品質のビデオリングリングデータの希少性。
これらの問題に対処するために、最初にT-GRPOアルゴリズムを提案します。これは、モデルが推論のためにビデオで一時的な情報を利用することを奨励しています。
さらに、ビデオデータのみに依存する代わりに、高品質の画像リングデータをトレーニングプロセスに組み込みます。
2つのデータセットを構築しました。SFTコールドスタート用のVideo-R1-COT-165Kと、RLトレーニング用のVideo-R1-260Kの両方で、画像データとビデオデータを含む。
実験結果は、Video-R1がVideommmuやVSI-Benchなどのビデオ推論ベンチマーク、およびMVBenchやTempCompassなどを含む一般的なビデオベンチマークで大幅に改善されることを示しています。
すべてのコード、モデル、データがリリースされます。

要約(オリジナル)

Inspired by DeepSeek-R1’s success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.

arxiv情報

著者	Kaituo Feng,Kaixiong Gong,Bohao Li,Zonghao Guo,Yibing Wang,Tianshuo Peng,Benyou Wang,Xiangyu Yue
発行日	2025-03-27 17:59:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video-R1: Reinforcing Video Reasoning in MLLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー