EgoVLM: Policy Optimization for Egocentric Video Understanding

要約

ウェアラブルカメラや自律型エージェントなど、新たな具現化AIアプリケーションは、一人称ビデオストリームからのロバストな推論の必要性を強調している。我々は、EgoVLMを紹介する。EgoVLMは、特に、自分中心のビデオコンテキスト内で視覚的理解と空間的-時間的推論を統合するように設計された視覚言語モデルである。EgoVLMはGroup Relative Policy Optimization (GRPO)によって微調整され、人間のような推論ステップにモデルの出力を合わせるように適応された強化学習手法である。DeepSeek R1-Zeroのアプローチに従い、CoT（chain-of-thought）データ上で教師ありの微調整フェーズを行わずに、RLを用いて直接チューニングを行う。EgoVLMを自己中心的なビデオ質問応答ベンチマークで評価し、ドメインに特化したトレーニングが汎用のVLMよりも性能を大幅に向上させることを示す。EgoVLM-3Bは、非CoTのエゴセントリックデータのみで学習され、EgoSchemaベンチマークにおいて、Qwen2.5-VLの3Bと7Bの基本モデルをそれぞれ14.33と13.87ポイント上回る精度を示した。推論トレースを明示的に生成することで、EgoVLMは解釈可能性を高め、下流のアプリケーションに適している。さらに、強化学習の最適化を導くために、顕著なフレーム選択を組み込んだ、新しいキーフレームベースの報酬を紹介する。この報酬の定式化は、時間的に根拠のある自我中心的推論における将来の研究のための有望な道を開く。

要約(オリジナル)

Emerging embodied AI applications, such as wearable cameras and autonomous agents, have underscored the need for robust reasoning from first person video streams. We introduce EgoVLM, a vision-language model specifically designed to integrate visual comprehension and spatial-temporal reasoning within egocentric video contexts. EgoVLM is fine-tuned via Group Relative Policy Optimization (GRPO), a reinforcement learning method adapted to align model outputs with human-like reasoning steps. Following DeepSeek R1-Zero’s approach, we directly tune using RL without any supervised fine-tuning phase on chain-of-thought (CoT) data. We evaluate EgoVLM on egocentric video question answering benchmarks and show that domain-specific training substantially improves performance over general-purpose VLMs. Our EgoVLM-3B, trained exclusively on non-CoT egocentric data, outperforms the base Qwen2.5-VL 3B and 7B models by 14.33 and 13.87 accuracy points on the EgoSchema benchmark, respectively. By explicitly generating reasoning traces, EgoVLM enhances interpretability, making it well-suited for downstream applications. Furthermore, we introduce a novel keyframe-based reward that incorporates salient frame selection to guide reinforcement learning optimization. This reward formulation opens a promising avenue for future exploration in temporally grounded egocentric reasoning.

arxiv情報

著者	Ashwin Vinod,Shrey Pandit,Aditya Vavre,Linshen Liu
発行日	2025-06-03 17:28:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EgoVLM: Policy Optimization for Egocentric Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー