Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

要約

深い思考モデルの最近の進歩により、数学的およびコーディングタスクに関する顕著な推論能力が実証されています。
ただし、画像アクションを介して環境との連続的な相互作用を必要とする具体化されたドメインにおけるそれらの有効性は、主に認定されたままです。
Empodied Reasherを提示します。これは、O1スタイルの推論をインタラクティブな具体化された検索タスクに拡張するモデルです。
主に論理的控除に依存する数学的推論とは異なり、具体化されたシナリオは、空間的理解、時間的推論、および相互作用履歴に基づいた継続的な自己反省を要求します。
これらの課題に対処するために、64kのインタラクティブな画像と90K多様な思考プロセスを含む9.3Kコヒーレント観測思考の軌跡を合成します（分析、空間推論、反射、計画、および検証）。
3段階のトレーニングパイプラインを開発し、模倣学習、拒否サンプリングを介した自己探求、および反射チューニングによる自己修正により、モデルの機能を徐々に強化します。
この評価は、私たちのモデルがこれらの高度な視覚推論モデルを大幅に上回ることを示しています。
分析により、私たちのモデルは、複雑な長期タスクに特に利点がある、繰り返しの検索と論理的な矛盾が少ないことを示すことが明らかになりました。
現実世界の環境は、繰り返しの検索と論理的な矛盾のケースが少なくなりながら、私たちの優位性も示しています。

要約(オリジナル)

Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model’s capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.

arxiv情報

著者	Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang
発行日	2025-03-27 17:00:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー