Transferring Foundation Models for Generalizable Robotic Manipulation

要約

現実世界の汎用ロボット操作エージェントの一般化能力を改善することは、長い間重要な課題でした。
既存のアプローチは、RT-1データセットなど、費用と時間がかかる大規模なロボットデータの収集に依存することがよくあります。
ただし、データの多様性が不十分なため、これらのアプローチは通常、新しいオブジェクトと多様な環境を使用して、オープンドメインシナリオでの能力を制限することに苦しんでいます。
この論文では、ロボット操作タスクを条件付けるために、インターネットスケールの基礎モデルによって生成された言語継続セグメンテーションマスクを効果的に活用する新しいパラダイムを提案します。
ビジョンファンデーションモデルから導出されたセマンティック、幾何学、および時間的相関プライアーをエンドツーエンドのポリシーモデルに組み込むマスクモダリティを統合することにより、当社のアプローチは、オブジェクト効率の一般化学習を含むサンプル効率の一般化学習を効果的かつ堅牢に知覚することができます。
新しいオブジェクトインスタンス、セマンティックカテゴリ、目に見えない背景。
最初に、複数のタスクにわたって自然言語の需要を接地するための一連の基礎モデルを紹介します。
第二に、模倣学習に基づいた2ストリームの2Dポリシーモデルを開発します。これは、ローカルグローバルの知覚方法でロボットアクションを予測するために生の画像とオブジェクトマスクを処理します。
フランカエミカロボットアームで実施された広範な実験実験は、提案されたパラダイムと政策アーキテクチャの有効性を示しています。
デモは提出されたビデオに記載されており、より包括的なビデオはLink1またはLink2にあります。

要約(オリジナル)

Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive realworld experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm and policy architecture. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.

arxiv情報

著者	Jiange Yang,Wenhui Tan,Chuhao Jin,Keling Yao,Bei Liu,Jianlong Fu,Ruihua Song,Gangshan Wu,Limin Wang
発行日	2025-02-07 14:58:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Transferring Foundation Models for Generalizable Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー