Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

要約

EGO-R1は、補強学習（RL）を介して訓練されたEGO-R1エージェントによって調整された構造化されたチェーンオブチャーチ（COTT）プロセスを活用する超長い（つまり、数日と数週間）エゴセントリックビデオで推論するための新しいフレームワークを紹介します。
人間の問題解決戦略に触発されたコットは、複雑な推論をモジュラーステップに分解し、RLエージェントは特定のツールをステップごとに呼び出し、一時的な検索やマルチモーダル理解などのタスクに取り組むサブ質問に繰り返し回答します。
コットデータとRLを使用して、前処理された言語モデルの監視された微調整（SFT）を含む2段階のトレーニングパラダイムを設計し、エージェントが長距離推論のために段階的なツールを動的に提案できるようにします。
トレーニングを容易にするために、SFTのエゴコット-25KとRL用のeGo-QA-4.4Kで構成されるEGO-R1データと呼ばれるデータセットを構築します。
さらに、当社のEGO-R1エージェントは、ハイブリッドソースからのヒトで検証されたQAペアを含む、新しくキュレーションされた1週間のビデオQAベンチマークであるEGO-R1ベンチで評価されます。
広範な結果は、EGO-R1エージェントによる動的でツールを熟成したチェーンの推論が、超長いエゴセントリックビデオを理解するというユニークな課題に効果的に取り組むことができ、数時間から1週間までの時間のカバレッジを大幅に拡大できることを示しています。

要約(オリジナル)

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

arxiv情報

著者	Shulin Tian,Ruiqi Wang,Hongming Guo,Penghao Wu,Yuhao Dong,Xiuying Wang,Jingkang Yang,Hao Zhang,Hongyuan Zhu,Ziwei Liu
発行日	2025-06-16 16:17:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー