OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

要約

ナビゲーション、操作、およびビジョンモデルの急速な進歩により、多くの特殊なタスクでモバイルマニピュレーターが能力を発揮しました。
ただし、オープンワールドのモバイル操作（OWMM）タスクは、オープンエンドの指示と環境への一般化の必要性と、グローバルシーンの理解と現在のエージェント状態の両方に基づいて、低レベルのロボット制御と高レベルの意思決定を統合するための体系的な複雑さのために、依然として課題のままです。
この複雑さに対処するために、意思決定のためにマルチビューシーンフレームとエージェント状態を維持し、関数呼び出しごとにロボットを制御する新しいマルチモーダルエージェントアーキテクチャを提案します。
2番目の課題は、ドメインシフトからの幻覚です。
エージェントのパフォーマンスを向上させるために、OWMMタスクのエージェントデータ合成パイプラインをさらに導入して、VLMモデルを命令微調整を使用してタスクドメインに適応させます。
グローバルなシーンの理解、ロボット状態追跡、統一モデルのマルチモーダルアクション生成を備えたモバイルマニピュレーター向けの最初の専用ファンデーションモデルとして、微調整されたOWMM-VLMを強調します。
実験を通じて、GPT-4OやReal Worldの強力なゼロショット一般化など、他の基礎モデルと比較して、モデルがSOTAパフォーマンスを達成することを実証します。
プロジェクトページはhttps://github.com/hhyhrhy/owmm-agentにあります

要約(オリジナル)

The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://github.com/HHYHRHY/OWMM-Agent

arxiv情報

著者	Junting Chen,Haotian Liang,Lingxiao Du,Weiyun Wang,Mengkang Hu,Yao Mu,Wenhai Wang,Jifeng Dai,Ping Luo,Wenqi Shao,Lin Shao
発行日	2025-06-04 17:57:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー