VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

要約

補強学習における最近の進歩により、マルチモーダルの大手言語モデル（MLLM）の推論能力が大幅に進歩しました。
グループ相対ポリシーの最適化（GRPO）やルールベースの報酬メカニズムなどのアプローチは、テキストおよび画像ドメインの約束を示していますが、ビデオ理解への適用は依然として限られています。
このペーパーでは、一般的な能力を維持しながら時空間知覚を強化することを目的とした、ビデオMLLMSのGRPOを使用した補強微調整（RFT）の体系的な調査を提示します。
私たちの実験は、RFTがタスク固有の改善のために非常にデータ効率が高いことを明らかにしています。
限られたサンプルを備えた空間的知覚目標に関するマルチタスクRFTを通じて、私たちは、チャット能力を犠牲にすることなく時空間知覚タスクの最先端のパフォーマンスを達成する強力なビデオMLLMであるVideoChat-R1を開発します。
QWEN2.5-VL-7Bと比較して、VideoChat-R1は、時間的接地（+31.8）やオブジェクト追跡（+31.2）などのタスクで数倍のパフォーマンスを高めます。
さらに、VideoMME（+0.9）、MVBench（+1.0）、知覚テスト（+0.9）などの一般的なQAベンチマークで大幅に改善されます。
私たちの調査結果は、ビデオMLLMSの特殊なタスク強化のためのRFTの可能性を強調しています。
私たちの作品が、ビデオMLLMSの将来のRL研究のための貴重な洞察を提供することを願っています。

要約(オリジナル)

Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

arxiv情報

著者	Xinhao Li,Ziang Yan,Desen Meng,Lu Dong,Xiangyu Zeng,Yinan He,Yali Wang,Yu Qiao,Yi Wang,Limin Wang
発行日	2025-04-10 16:28:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー