Guiding Long-Horizon Task and Motion Planning with Vision Language Models




Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate goth semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLM- TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%). See project site for more information.


著者 Zhutian Yang,Caelan Garrett,Dieter Fox,Tomás Lozano-Pérez,Leslie Pack Kaelbling
発行日 2024-10-03 04:14:21+00:00
Capturing complex hand movements and object interactions using machine learning-powered stretchable smart textile gloves




Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.


著者 Arvin Tashakori,Zenan Jiang,Amir Servati,Saeid Soltanian,Harishkumar Narayana,Katherine Le,Caroline Nakayama,Chieh-ling Yang,Z. Jane Wang,Janice J. Eng,Peyman Servati
発行日 2024-10-03 05:32:16+00:00
Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own


強化学習(RL)は、ロボットの操作タスクを解決するための有望なアプローチである。しかし、RLアルゴリズムを実世界で直接適用するのは難しい。ひとつには、RLはデータ集約的であり、通常、数百万回の環境との相互作用を必要とするため、現実のシナリオでは非現実的である。また、報酬関数を手動で設計するためには、多大な工学的努力が必要である。これらの問題に対処するため、本稿では基礎モデルを活用する。我々は、政策、価値、成功報酬の基礎モデルからのガイダンスとフィードバックを利用するために、基礎プリアによる強化学習(Reinforcement Learning with Foundation Priors: RLFP)を提案する。このフレームワークの中で、我々は、自動的な報酬関数を用いて、より効率的に探索することを可能にする、ファンデーションガイド付きアクタークリティック(Foundation-guided Actor-Critic:FAC)アルゴリズムを導入する。私たちのフレームワークの利点は3つあります:(1)⑷テキストの効率化、(2)⑸最小かつ効果的な報酬工学、(3)⑸基礎モデルの形式にとらわれない、ノイズの多い事前分布に頑健。我々の手法は、実ロボットとシミュレーションの両方で、様々な操作タスクにおいて顕著な性能を達成した。実ロボットでの5つの器用なタスクにおいて、FACは1時間のリアルタイム学習で平均86%の成功率を達成した。シミュレーションされたメタワールドの8つのタスクにおいて、FACは100kフレーム未満(約1時間の学習)で7/8のタスクで100%の成功率を達成し、1Mフレームで手動で報酬を設計したベースライン手法を凌駕した。我々は、RLFPフレームワークにより、将来ロボットが物理世界でより多くのタスクを自律的に探索・学習できるようになると考えている。


Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks.


著者 Weirui Ye,Yunsheng Zhang,Haoyang Weng,Xianfan Gu,Shengjie Wang,Tong Zhang,Mengchen Wang,Pieter Abbeel,Yang Gao
発行日 2024-10-03 05:57:42+00:00
End-to-end Driving in High-Interaction Traffic Scenarios with Reinforcement Learning


ダイナミックでインタラクティブな交通シナリオは、自律走行システムにとって大きな課題となる。強化学習(RL)は、特に複雑な環境において、事前に収集されたデータセットや事前に定義された条件の制約を超えた運転ポリシーの探索を可能にすることで、有望なアプローチを提供する。しかし、高次元のマルチモーダルな観測データから空間的・時間的特徴を効果的に抽出し、かつ経時的な誤差の蓄積を最小化することが重要な課題である。さらに、大規模なRLモデルを効率的に導き、学習過程で頻繁に失敗することなく最適な運転方針に収束させることは、依然として困難である。 我々はこれらの問題に対処するために、Rambleと名付けられたエンドツーエンドのモデルベースRLアルゴリズムを提案する。Rambleは、マルチビューのRGB画像とLiDAR点群を低次元の潜在特徴に処理し、各時間ステップにおける交通シナリオのコンテキストを捉える。次に、時間依存関係をモデル化し、将来の状態を予測するために、変換器ベースのアーキテクチャが採用される。環境の力学モデルを学習することにより、Rambleは今後の交通事象を予測し、より多くの情報に基づいた戦略的な意思決定を行うことができる。我々の実装は、特徴抽出と意思決定における事前の経験が、最適な運転ポリシーに向けたRLモデルの収束を加速する上で極めて重要な役割を果たすことを実証している。Rambleは、CARLA Leaderboard 2.0において、ルート完走率と運転スコアに関して最先端の性能を達成し、複雑で動的な交通状況を管理する上で有効であることを示している。


Dynamic and interactive traffic scenarios pose significant challenges for autonomous driving systems. Reinforcement learning (RL) offers a promising approach by enabling the exploration of driving policies beyond the constraints of pre-collected datasets and predefined conditions, particularly in complex environments. However, a critical challenge lies in effectively extracting spatial and temporal features from sequences of high-dimensional, multi-modal observations while minimizing the accumulation of errors over time. Additionally, efficiently guiding large-scale RL models to converge on optimal driving policies without frequent failures during the training process remains tricky. We propose an end-to-end model-based RL algorithm named Ramble to address these issues. Ramble processes multi-view RGB images and LiDAR point clouds into low-dimensional latent features to capture the context of traffic scenarios at each time step. A transformer-based architecture is then employed to model temporal dependencies and predict future states. By learning a dynamics model of the environment, Ramble can foresee upcoming traffic events and make more informed, strategic decisions. Our implementation demonstrates that prior experience in feature extraction and decision-making plays a pivotal role in accelerating the convergence of RL models toward optimal driving policies. Ramble achieves state-of-the-art performance regarding route completion rate and driving score on the CARLA Leaderboard 2.0, showcasing its effectiveness in managing complex and dynamic traffic situations.


著者 Yueyuan Li,Mingyang Jiang,Songan Zhang,Wei Yuan,Chunxiang Wang,Ming Yang
発行日 2024-10-03 06:45:59+00:00
Semantic Communication and Control Co-Design for Multi-Objective Correlated Dynamics




This letter introduces a machine-learning approach to learning the semantic dynamics of correlated systems with different control rules and dynamics. By leveraging the Koopman operator in an autoencoder (AE) framework, the system’s state evolution is linearized in the latent space using a dynamic semantic Koopman (DSK) model, capturing the baseline semantic dynamics. Signal temporal logic (STL) is incorporated through a logical semantic Koopman (LSK) model to encode system-specific control rules. These models form the proposed logical Koopman AE framework that reduces communication costs while improving state prediction accuracy and control performance, showing a 91.65% reduction in communication samples and significant performance gains in simulation.


著者 Abanoub M. Girgis,Hyowoon Seo,Mehdi Bennis
発行日 2024-10-03 08:38:54+00:00
QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity




Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.


著者 Johann Huber,François Hélénon,Mathilde Kappel,Ignacio de Loyola Páez-Ubieta,Santiago T. Puente,Pablo Gil,Faïz Ben Amar,Stéphane Doncieux
発行日 2024-10-03 08:56:14+00:00
Data Optimisation of Machine Learning Models for Smart Irrigation in Urban Parks




Urban environments face significant challenges due to climate change, including extreme heat, drought, and water scarcity, which impact public health, community well-being, and local economies. Effective management of these issues is crucial, particularly in areas like Sydney Olympic Park, which relies on one of Australia’s largest irrigation systems. The Smart Irrigation Management for Parks and Cool Towns (SIMPaCT) project, initiated in 2021, leverages advanced technologies and machine learning models to optimize irrigation and induce physical cooling. This paper introduces two novel methods to enhance the efficiency of the SIMPaCT system’s extensive sensor network and applied machine learning models. The first method employs clustering of sensor time series data using K-shape and K-means algorithms to estimate readings from missing sensors, ensuring continuous and reliable data. This approach can detect anomalies, correct data sources, and identify and remove redundant sensors to reduce maintenance costs. The second method involves sequential data collection from different sensor locations using robotic systems, significantly reducing the need for high numbers of stationary sensors. Together, these methods aim to maintain accurate soil moisture predictions while optimizing sensor deployment and reducing maintenance costs, thereby enhancing the efficiency and effectiveness of the smart irrigation system. Our evaluations demonstrate significant improvements in the efficiency and cost-effectiveness of soil moisture monitoring networks. The cluster-based replacement of missing sensors provides up to 5.4% decrease in average error. The sequential sensor data collection as a robotic emulation shows 17.2% and 2.1% decrease in average error for circular and linear paths respectively.


著者 Nasser Ghadiri,Bahman Javadi,Oliver Obst,Sebastian Pfautsch
発行日 2024-10-03 09:42:16+00:00
Coastal Underwater Evidence Search System with Surface-Underwater Collaboration




The Coastal underwater evidence search system with surface-underwater collaboration is designed to revolutionize the search for artificial objects in coastal underwater environments, overcoming limitations associated with traditional methods such as divers and tethered remotely operated vehicles. Our innovative multi-robot collaborative system consists of three parts, an autonomous surface vehicle as a mission control center, a towed underwater vehicle for wide-area search, and a biomimetic underwater robot inspired by marine organisms for detailed inspections of identified areas. We conduct extensive simulations and real-world experiments in pond environments and coastal fields to demonstrate the system potential to surpass the limitations of conventional underwater search methods, offering a robust and efficient solution for law enforcement and recovery operations in marine settings.


著者 Hin Wang Lin,Pengyu Wang,Zhaohua Yang,Ka Chun Leung,Fangming Bao,Ka Yu Kui,Jian Xiang Erik Xu,Ling Shi
発行日 2024-10-03 09:57:19+00:00
Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks




Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation. Demonstration videos are available online at:


著者 Zeyu Feng,Hao Luan,Kevin Yuchen Ma,Harold Soh
発行日 2024-10-03 11:10:37+00:00
RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation


RiEMannは、SE(3)-Equivariant Robot Manipulationの模倣学習フレームワークである。記述子フィールドのマッチングに依存する従来の手法と比較して、RiEMannはオブジェクトのセグメンテーションを行うことなく、操作の対象となるオブジェクトのポーズを直接予測する。RiEMannは、5~10回のデモンストレーションにより、ゼロから操作タスクを学習し、未知のSE(3)変換やターゲットオブジェクトのインスタンスに汎化し、注意散漫なオブジェクトの視覚干渉に抵抗し、ターゲットオブジェクトのほぼリアルタイムの姿勢変化に追従する。RiEMannのスケーラブルなアクション空間は、蛇口を回す方向などのカスタム等変量アクションの追加を容易にし、RiEMannの多関節物体操作を可能にする。シミュレーションと実世界の6自由度ロボット操作実験において、RiEMannを5つのカテゴリの操作タスクと合計25のバリエーションでテストし、RiEMannがタスク成功率と予測ポーズのSE(3)測地距離誤差(68.6%減少)の両方でベースラインを上回り、5.4フレーム/秒(FPS)のネットワーク推論速度を達成することを示す。コードとビデオの結果は。


We present RiEMann, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input. Compared to previous methods that rely on descriptor field matching, RiEMann directly predicts the target poses of objects for manipulation without any object segmentation. RiEMann learns a manipulation task from scratch with 5 to 10 demonstrations, generalizes to unseen SE(3) transformations and instances of target objects, resists visual interference of distracting objects, and follows the near real-time pose change of the target object. The scalable action space of RiEMann facilitates the addition of custom equivariant actions such as the direction of turning the faucet, which makes articulated object manipulation possible for RiEMann. In simulation and real-world 6-DOF robot manipulation experiments, we test RiEMann on 5 categories of manipulation tasks with a total of 25 variants and show that RiEMann outperforms baselines in both task success rates and SE(3) geodesic distance errors on predicted poses (reduced by 68.6%), and achieves a 5.4 frames per second (FPS) network inference speed. Code and video results are available at


著者 Chongkai Gao,Zhengrong Xue,Shuying Deng,Tianhai Liang,Siqi Yang,Lin Shao,Huazhe Xu
発行日 2024-10-03 11:13:29+00:00
