jarxiv | Japanese arxiv | ページ 236

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

投稿日: 2025年6月3日作成者: jarxiv

要約

Vision-Language-activeモデル（VLA）は、一般主義ロボットポリシーとしての可能性を示しています。
ただし、これらのモデルは、環境、ロボット自体、および人間への害のリスクなど、実際の展開中に極端な安全性の課題をもたらします。
安全上の制約は、どのようにしてVLAに明示的に統合できますか？
統合された安全性アプローチ（ISA）を調査し、安全要件を体系的にモデル化し、多様な安全でない行動を積極的に引き出し、安全な強化学習を通じてVLAポリシーを効果的に制約し、ターゲット評価を通じて安全性を厳密に保証することにより、これに対処します。
制約されたマルコフ決定プロセス（CMDP）パラダイムを活用すると、ISAはMIN-MAXの観点からVLAを誘発された安全リスクに対して最適化します。
したがって、この包括的なアプローチを通じて整合したポリシーは、次の重要な機能を達成します。（i）効果的な安全性パフォーマンスのトレードオフでは、この探索は、現在の最先端の方法と比較して83.58％の安全改善をもたらし、タスクのパフォーマンスを維持します（+3.85％）。
（ii）長い尾のリスクを軽減し、極端な故障シナリオを処理する能力を備えた強力な安全保証。
（iii）さまざまな分散型摂動に対する学習された安全行動の堅牢な一般化。
当社のデータ、モデル、新たに提案されたベンチマーク環境は、https：//pku-safevla.github.ioで入手できます。

要約(オリジナル)

Vision-language-action models (VLAs) show potential as generalist robot policies. However, these models pose extreme safety challenges during real-world deployment, including the risk of harm to the environment, the robot itself, and humans. How can safety constraints be explicitly integrated into VLAs? We address this by exploring an integrated safety approach (ISA), systematically modeling safety requirements, then actively eliciting diverse unsafe behaviors, effectively constraining VLA policies via safe reinforcement learning, and rigorously assuring their safety through targeted evaluations. Leveraging the constrained Markov decision process (CMDP) paradigm, ISA optimizes VLAs from a min-max perspective against elicited safety risks. Thus, policies aligned through this comprehensive approach achieve the following key features: (I) effective safety-performance trade-offs, this exploration yields an 83.58% safety improvement compared to the current state-of-the-art method, while also maintaining task performance (+3.85%). (II) strong safety assurance, with the ability to mitigate long-tail risks and handle extreme failure scenarios. (III) robust generalization of learned safety behaviors to various out-of-distribution perturbations. Our data, models and newly proposed benchmark environment are available at https://pku-safevla.github.io.

arxiv情報

著者	Borong Zhang,Yuhao Zhang,Jiaming Ji,Yingshan Lei,Josef Dai,Yuanpei Chen,Yaodong Yang
発行日	2025-05-31 14:22:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.RO | コメントを受け付けていません

Fully Onboard SLAM for Distributed Mapping with a Swarm of Nano-Drones

投稿日: 2025年6月3日作成者: jarxiv

要約

無人航空機（UAV）の使用は、監視や救急ミッションから他の機械や人間との協力を含む産業自動化に至るまでのアプリケーションで急速に増加しています。
エリアカバレッジを最大化し、ミッションレイテンシを減らすために、コラボレーションドローンの群れが重要な研究方向になりました。
ただし、このアプローチには、対処するための位置決め、マッピング、および通信におけるオープンな課題が必要です。
この作業では、35 gの限られたペイロードと密着したオンボードセンシングおよびコンピューティング機能を特徴とするナノUAVの群れに基づいた分散マッピングシステムについて説明します。
各Nano-UAVには、4つの方向の障害物までの相対距離を測定する4つの64ピクセル深度センサーが装備されています。
提案されたシステムは、群れから情報をマージし、外部インフラストラクチャに依存することなく、コヒーレントなグリッドマップを生成します。
データ融合は、繰り返し近くの最も近いポイントアルゴリズムとグラフベースの同時ローカリゼーションとマッピングアルゴリズムを使用して実行されます。
最大4つのナノUAVの群れで3つの異なる迷路で収集されたフィールドの結果は、12 cmのマッピング精度を証明し、マッピング時間がエージェントの数に反比例していることを示しています。
提案されたフレームワークは、通信帯域幅とオンボード計算の複雑さの観点から直線的にスケーリングし、最大20個のナノUAV間の通信をサポートし、最大180 m2の領域のマッピングをサポートし、選択した構成は50 kbのメモリを必要とします。

要約(オリジナル)

The use of Unmanned Aerial Vehicles (UAVs) is rapidly increasing in applications ranging from surveillance and first-aid missions to industrial automation involving cooperation with other machines or humans. To maximize area coverage and reduce mission latency, swarms of collaborating drones have become a significant research direction. However, this approach requires open challenges in positioning, mapping, and communications to be addressed. This work describes a distributed mapping system based on a swarm of nano-UAVs, characterized by a limited payload of 35 g and tightly constrained onboard sensing and computing capabilities. Each nano-UAV is equipped with four 64-pixel depth sensors that measure the relative distance to obstacles in four directions. The proposed system merges the information from the swarm and generates a coherent grid map without relying on any external infrastructure. The data fusion is performed using the iterative closest point algorithm and a graph-based simultaneous localization and mapping algorithm, running entirely onboard the UAV’s low-power ARM Cortex-M microcontroller with just 192 kB of memory. Field results gathered in three different mazes with a swarm of up to 4 nano-UAVs prove a mapping accuracy of 12 cm and demonstrate that the mapping time is inversely proportional to the number of agents. The proposed framework scales linearly in terms of communication bandwidth and onboard computational complexity, supporting communication between up to 20 nano-UAVs and mapping of areas up to 180 m2 with the chosen configuration requiring only 50 kB of memory.

arxiv情報

著者	Carl Friess,Vlad Niculescu,Tommaso Polonelli,Michele Magno,Luca Benini
発行日	2025-05-31 17:33:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

Falcon: Fast Visuomotor Policies via Partial Denoising

投稿日: 2025年6月3日作成者: jarxiv

要約

拡散ポリシーは、マルチモーダルアクション分布をキャプチャする能力のために、複雑な視覚運動タスクで広く採用されています。
ただし、アクション生成に必要な複数のサンプリングステップは、リアルタイムの推論効率を大きく損なうため、リアルタイムの意思決定シナリオでの適用性が制限されます。
既存の加速手法では、低いサンプリングステップの下で再訓練または劣化性能を必要とします。
ここでは、この速度パフォーマンスのトレードオフを軽減し、さらなる加速を達成するFalconを提案します。
中心的な洞察は、視覚運動タスクがアクション間で連続的な依存関係を示すことです。
ファルコンは、各ステップでガウスノイズからサンプリングするのではなく、履歴情報から部分的に除去されたアクションを再利用することにより、これを活用します。
現在の観測を統合することにより、ファルコンはパフォーマンスを維持しながらサンプリングステップを削減します。
重要なことに、Falconはプラグインとして適用できるトレーニングなしのアルゴリズムであり、既存の加速技術に加えて決定効率をさらに向上させることができます。
48のシミュレートされた環境と2つの実際のロボット実験でFalconを検証しました。
パフォーマンスの低下を伴う2〜7倍のスピードアップを実証し、効率的な視覚運動ポリシー設計のための有望な方向性を提供します。

要約(オリジナル)

Diffusion policies are widely adopted in complex visuomotor tasks for their ability to capture multimodal action distributions. However, the multiple sampling steps required for action generation significantly harm real-time inference efficiency, which limits their applicability in real-time decision-making scenarios. Existing acceleration techniques either require retraining or degrade performance under low sampling steps. Here we propose Falcon, which mitigates this speed-performance trade-off and achieves further acceleration. The core insight is that visuomotor tasks exhibit sequential dependencies between actions. Falcon leverages this by reusing partially denoised actions from historical information rather than sampling from Gaussian noise at each step. By integrating current observations, Falcon reduces sampling steps while preserving performance. Importantly, Falcon is a training-free algorithm that can be applied as a plug-in to further improve decision efficiency on top of existing acceleration techniques. We validated Falcon in 48 simulated environments and 2 real-world robot experiments. demonstrating a 2-7x speedup with negligible performance degradation, offering a promising direction for efficient visuomotor policy design.

arxiv情報

著者	Haojun Chen,Minghao Liu,Chengdong Ma,Xiaojian Ma,Zailin Ma,Huimin Wu,Yuanpei Chen,Yifan Zhong,Mingzhi Wang,Qing Li,Yaodong Yang
発行日	2025-05-31 18:44:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

CogAD: Cognitive-Hierarchy Guided End-to-End Autonomous Driving

投稿日: 2025年6月3日作成者: jarxiv

要約

エンドツーエンドの自律運転は大幅に進歩していますが、知覚と計画の両方において、一般的な方法は人間の認知原理と根本的に矛盾したままです。
この論文では、人間のドライバーの階層的認知メカニズムをエミュレートする新しいエンドツーエンドの自律運転モデルであるCogadを提案します。
COGADは、デュアル階層メカニズムを実装しています。人間のような知覚と、認知インスピレーションを受けた計画のための意図的な条件付きマルチモード軌跡生成のためのグローバルからローカルのコンテスト処理。
提案された方法は、3つの主要な利点を示しています。階層的認識を通じて包括的な環境理解、マルチレベルの計画によって可能になった堅牢な計画探査、およびデュアルレベルの不確実性モデリングによって促進される多様でありながら合理的なマルチモーダル軌跡生成です。
ヌスセンとベンチ2Driveに関する広範な実験は、CoGADがエンドツーエンドの計画で最先端のパフォーマンスを達成し、長期尾のシナリオで特定の優位性を示し、複雑な現実世界の運転条件に対する堅牢な一般化を示すことを示しています。

要約(オリジナル)

While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.

arxiv情報

著者	Zhennan Wang,Jianing Teng,Canqun Xiang,Kangliang Chen,Xing Pan,Lu Deng,Weihao Gu
発行日	2025-06-01 02:19:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Stairway to Success: Zero-Shot Floor-Aware Object-Goal Navigation via LLM-Driven Coarse-to-Fine Exploration

投稿日: 2025年6月3日作成者: jarxiv

要約

オブジェクトゴールナビゲーション（OGN）は、現実世界の複数床環境およびオープンボキャブラリーオブジェクトの説明の下で依然として挑戦的です。
HM3DやMP3Dなどの広く使用されているベンチマークのほとんどのエピソードには、複数床の建物が含まれており、多くのエピソードが明示的な床遷移を必要としています。
ただし、既存の方法は、多くの場合、単一階の設定または事前定義されたオブジェクトカテゴリに限定されます。
これらの制限に対処するために、2つの重要な課題に取り組みます。（1）効率的なクロスレベルの計画と（2）エージェントが事前の露出なしに新しいオブジェクトの説明を解釈する必要があるゼロショットオブジェクトゴールナビゲーション（ZS-ogn）。
Ascentを提案します。これは、階層的なセマンティックマッピング用の多階空間抽象モジュールと、新しいオブジェクトセマンティクスまたは機関車データに関する追加のトレーニングを必要とせずに、コンテキスト対応の探索用の大規模な言語モデル（LLMS）を活用する粗からファインのフロンティア推論モジュールを組み合わせたものです。
私たちの方法は、効率的なマルチフロアナビゲーションを有効にしながら、HM3DおよびMP3Dベンチマークで最先端のZS-GONGアプローチよりも優れています。
さらに、象限のロボットでの現実世界の展開を通じてその実用性を検証し、目に見えない床でオブジェクトの探索を成功させます。

要約(オリジナル)

Object-Goal Navigation (OGN) remains challenging in real-world, multi-floor environments and under open-vocabulary object descriptions. We observe that most episodes in widely used benchmarks such as HM3D and MP3D involve multi-floor buildings, with many requiring explicit floor transitions. However, existing methods are often limited to single-floor settings or predefined object categories. To address these limitations, we tackle two key challenges: (1) efficient cross-level planning and (2) zero-shot object-goal navigation (ZS-OGN), where agents must interpret novel object descriptions without prior exposure. We propose ASCENT, a framework that combines a Multi-Floor Spatial Abstraction module for hierarchical semantic mapping and a Coarse-to-Fine Frontier Reasoning module leveraging Large Language Models (LLMs) for context-aware exploration, without requiring additional training on new object semantics or locomotion data. Our method outperforms state-of-the-art ZS-OGN approaches on HM3D and MP3D benchmarks while enabling efficient multi-floor navigation. We further validate its practicality through real-world deployment on a quadruped robot, achieving successful object exploration across unseen floors.

arxiv情報

著者	Zeying Gong,Rong Li,Tianshuai Hu,Ronghe Qiu,Lingdong Kong,Lingfeng Zhang,Yiyi Ding,Leying Zhang,Junwei Liang
発行日	2025-06-01 02:48:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

Learning to Drift in Extreme Turning with Active Exploration and Gaussian Process Based MPC

投稿日: 2025年6月3日作成者: jarxiv

要約

レースの極端なコーナリングは、多くの場合、大きなサイドスリップ角度につながり、車両制御に大きな課題を提示します。
従来の車両コントローラーは、このシナリオを管理するのに苦労しており、ドリフトコントローラーの使用を必要とします。
ただし、ドリフト条件の大きな横滑り角度は、モデルの不一致をもたらし、コントロールの精度に影響します。
この問題に対処するために、モデル予測制御（MPC）とガウスプロセス回帰（GPR）を統合するモデル補正ドリフトコントローラーを提案します。
GPRは、ドリフト平衡解決とMPC最適化プロセスの両方で、車両モデルの不一致を修正するために採用されています。
さらに、GPRからの分散を利用して、軌跡追跡エラーを最小限に抑えることを目指して、さまざまなコーナリングの漂流速度を積極的に調査します。
提案されたアルゴリズムは、Simulink-Carsimプラットフォーム上のシミュレーションと1:10スケールRC車両の実験を通じて検証されます。
シミュレーションでは、GPRの平均横方向誤差は、非GPRの場合と比較して52.8％減少します。
探査を組み込むと、このエラーがさらに27.1％減少します。
速度追跡ルート平均平方根誤差（RMSE）も、探索により10.6％減少します。
RC CAR実験では、GPRの平均横方向誤差は36.7％低く、探索により29.0％の削減がさらにつながります。
さらに、探索を含めると、速度追跡RMSE RMSEは7.2％減少します。

要約(オリジナル)

Extreme cornering in racing often leads to large sideslip angles, presenting a significant challenge for vehicle control. Conventional vehicle controllers struggle to manage this scenario, necessitating the use of a drifting controller. However, the large sideslip angle in drift conditions introduces model mismatch, which in turn affects control precision. To address this issue, we propose a model correction drift controller that integrates Model Predictive Control (MPC) with Gaussian Process Regression (GPR). GPR is employed to correct vehicle model mismatches during both drift equilibrium solving and the MPC optimization process. Additionally, the variance from GPR is utilized to actively explore different cornering drifting velocities, aiming to minimize trajectory tracking errors. The proposed algorithm is validated through simulations on the Simulink-Carsim platform and experiments with a 1:10 scale RC vehicle. In the simulation, the average lateral error with GPR is reduced by 52.8% compared to the non-GPR case. Incorporating exploration further decreases this error by 27.1%. The velocity tracking Root Mean Square Error (RMSE) also decreases by 10.6% with exploration. In the RC car experiment, the average lateral error with GPR is 36.7% lower, and exploration further leads to a 29.0% reduction. Moreover, the velocity tracking RMSE decreases by 7.2% with the inclusion of exploration.

arxiv情報

著者	Guoqiang Wu,Cheng Hu,Wangjia Weng,Zhouheng Li,Yonghao Fu,Lei Xie,Hongye Su
発行日	2025-06-01 04:26:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.RO, cs.SY, eess.SY | コメントを受け付けていません

Cognitive Guardrails for Open-World Decision Making in Autonomous Drone Swarms

投稿日: 2025年6月3日作成者: jarxiv

要約

小規模な航空システム（SUA）は、捜索救助およびその他の災害反応シナリオで自律的な群れとしてますます展開されています。
これらの設定では、コンピュータービジョン（CV）を使用して関心のあるオブジェクトを検出し、ミッションを自律的に適応させます。
ただし、従来のCVシステムは、オープンワールド環境でなじみのないオブジェクトを認識したり、ミッション計画に関連することを推測するのに苦労しています。
これに対処するために、検出されたオブジェクトとその意味について推論するために、大きな言語モデル（LLM）を組み込みます。
LLMは貴重な洞察を提供することができますが、幻覚を起こしやすく、誤った、誤解を招く、または安全でない推奨事項を生み出す可能性があります。
不確実性の下で安全で賢明な意思決定を確保するには、認知ガードレールによって高レベルの決定を支配する必要があります。
この記事では、これらのガードレールの設計、シミュレーション、および実世界の統合を、捜索救助ミッションでのSUAS群れの統合について説明します。

要約(オリジナル)

Small Uncrewed Aerial Systems (sUAS) are increasingly deployed as autonomous swarms in search-and-rescue and other disaster-response scenarios. In these settings, they use computer vision (CV) to detect objects of interest and autonomously adapt their missions. However, traditional CV systems often struggle to recognize unfamiliar objects in open-world environments or to infer their relevance for mission planning. To address this, we incorporate large language models (LLMs) to reason about detected objects and their implications. While LLMs can offer valuable insights, they are also prone to hallucinations and may produce incorrect, misleading, or unsafe recommendations. To ensure safe and sensible decision-making under uncertainty, high-level decisions must be governed by cognitive guardrails. This article presents the design, simulation, and real-world integration of these guardrails for sUAS swarms in search-and-rescue missions.

arxiv情報

著者	Jane Cleland-Huang,Pedro Antonio Alarcon Granadeno,Arturo Miguel Russell Bernal,Demetrius Hernandez,Michael Murphy,Maureen Petterson,Walter Scheirer
発行日	2025-06-01 06:27:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.HC, cs.RO | コメントを受け付けていません

VB-Com: Learning Vision-Blind Composite Humanoid Locomotion Against Deficient Perception

投稿日: 2025年6月3日作成者: jarxiv

要約

足の移動のパフォーマンスは、状態観測の精度と包括性と密接に結びついています。
固有受容のみに依存する盲目の政策は、固有受容観察の信頼性のために非常に堅牢であると考えられています。
ただし、これらのポリシーは移動速度を大幅に制限しており、多くの場合、適応するために地形との衝突が必要です。
対照的に、ビジョンポリシーにより、ロボットは事前に動きを計画し、オンライン認識モジュールを使用して非構造化された地形に積極的に対応できます。
ただし、ノイズの多い現実世界の環境、潜在的なセンサーの障害、および動的または変形可能な地形を提示する際の現在のシミュレーションの制限により、知覚はしばしば妥協されます。
自由度が高く、本質的に不安定な形態を持つヒューマノイドロボットは、特に知覚不足からの誤用を受けやすく、挑戦的な動的地形の転倒や終了をもたらす可能性があります。
ビジョンとブラインドポリシーの両方の利点を活用するために、ヒューマノイドロボットがビジョンポリシーに依存するタイミングと知覚不足の下で盲目のポリシーにいつ切り替えるかを決定できる複合フレームワークであるVB-COMを提案します。
VB-COMは、動的な地形や知覚騒音によって引き起こされる知覚の欠陥にもかかわらず、ヒューマノイドロボットが挑戦的な地形と障害を横断することを効果的に可能にすることを実証します。

要約(オリジナル)

The performance of legged locomotion is closely tied to the accuracy and comprehensiveness of state observations. Blind policies, which rely solely on proprioception, are considered highly robust due to the reliability of proprioceptive observations. However, these policies significantly limit locomotion speed and often require collisions with the terrain to adapt. In contrast, Vision policies allows the robot to plan motions in advance and respond proactively to unstructured terrains with an online perception module. However, perception is often compromised by noisy real-world environments, potential sensor failures, and the limitations of current simulations in presenting dynamic or deformable terrains. Humanoid robots, with high degrees of freedom and inherently unstable morphology, are particularly susceptible to misguidance from deficient perception, which can result in falls or termination on challenging dynamic terrains. To leverage the advantages of both vision and blind policies, we propose VB-Com, a composite framework that enables humanoid robots to determine when to rely on the vision policy and when to switch to the blind policy under perceptual deficiency. We demonstrate that VB-Com effectively enables humanoid robots to traverse challenging terrains and obstacles despite perception deficiencies caused by dynamic terrains or perceptual noise.

arxiv情報

著者	Junli Ren,Tao Huang,Huayi Wang,Zirui Wang,Qingwei Ben,Junfeng Long,Yanchao Yang,Jiangmiao Pang,Ping Luo
発行日	2025-06-01 10:13:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval

投稿日: 2025年6月3日作成者: jarxiv

要約

私たちは、人間の手のデモンストレーションを通してロボットを教えるためのシンプルで時間効率の良い方法であるコミュニティハンドを渡します。
手動で収集されたタスク固有のロボットデモンストレーションに依存する代わりに、ハンドは簡単にプロビッドできる手のデモンストレーションを使用して、タスクに依存しないロボット再生データから関連する行動を取得します。
視覚的な追跡パイプラインを使用して、手の手の動きを手で抽出し、2つの段階でロボットのサブトリューションを取得します。最初に視覚的類似性によるフィルタリング、次に同様の動作を持つ軌跡を取得します。
検索されたデータに関するポリシーを微調整すると、キャリブレーションされたカメラや詳細なハンドポーズの推定を必要とせずに、4分以内にタスクのリアルタイム学習を可能にします。
また、実験は、実際のロボットでの平均タスク成功率で、ハンドアウトパフォーマンスの検索ベースラインが2倍以上であることを示しています。
ビデオは、プロジェクトWebサイトhttps：//liralab.usc.edu/handretrieval/にあります。

要約(オリジナル)

We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.

arxiv情報

著者	Matthew Hong,Anthony Liang,Kevin Kim,Harshitha Rajaprakash,Jesse Thomason,Erdem Bıyık,Jesse Zhang
発行日	2025-06-01 13:59:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.RO | コメントを受け付けていません

RLZero: Direct Policy Inference from Language Without In-Domain Supervision

投稿日: 2025年6月3日作成者: jarxiv

要約

報酬仮説では、すべての目標と目的は、受信したスカラー報酬信号の最大化として理解できると述べています。
ただし、実際には、そのような報酬信号を定義することは、人間が報酬機能に対応する最適な動作を予測できないことが多いため、難しいことで有名です。
Natural Languageは、強化学習（RL）エージェントを指導するための直感的な代替手段を提供しますが、言語指導を考慮して、費用のかかる監督またはテスト時間トレーニングが必要です。
この作業では、任意の自然言語の指示からゼロショットテスト時間ポリシーの推論を取得するために、タスク固有の監視またはラベル付きの軌跡を使用して、ラベルのないオフラインインタラクションのみを使用してトレーニングされた前処理されたRLエージェントを使用する新しいアプローチを提示します。
想像、プロジェクト、模倣の3つのステップで構成されるフレームワークを紹介します。
最初に、エージェントは、ビデオ生成モデルを使用して、提供された言語の説明に対応する一連の観測値を想像します。
次に、これらの想像上の観察結果は、ターゲット環境ドメインに投影されます。
最後に、監視されていないRLを備えたターゲット環境で前処理されたエージェントは、閉じた型溶液を介して予測される観測シーケンスを即座に模倣します。
私たちの知る限り、私たちの方法であるRlzeroは、ドメイン内の監督なしで、さまざまなタスクや環境で直接的な言語から行動への生成能力を示す最初のアプローチです。
さらに、rlzeroのコンポーネントを使用して、ヒューマノイドのような複雑な実施形態であっても、YouTubeで利用可能な動画など、囲まれた動画からゼロショットを生成できることを示しています。

要約(オリジナル)

The reward hypothesis states that all goals and purposes can be understood as the maximization of a received scalar reward signal. However, in practice, defining such a reward signal is notoriously difficult, as humans are often unable to predict the optimal behavior corresponding to a reward function. Natural language offers an intuitive alternative for instructing reinforcement learning (RL) agents, yet previous language-conditioned approaches either require costly supervision or test-time training given a language instruction. In this work, we present a new approach that uses a pretrained RL agent trained using only unlabeled, offline interactions–without task-specific supervision or labeled trajectories–to get zero-shot test-time policy inference from arbitrary natural language instructions. We introduce a framework comprising three steps: imagine, project, and imitate. First, the agent imagines a sequence of observations corresponding to the provided language description using video generative models. Next, these imagined observations are projected into the target environment domain. Finally, an agent pretrained in the target environment with unsupervised RL instantly imitates the projected observation sequence through a closed-form solution. To the best of our knowledge, our method, RLZero, is the first approach to show direct language-to-behavior generation abilities on a variety of tasks and environments without any in-domain supervision. We further show that components of RLZero can be used to generate policies zero-shot from cross-embodied videos, such as those available on YouTube, even for complex embodiments like humanoids.

arxiv情報

著者	Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo,Samyak Parajuli,Caleb Chuck,Max Rudolph,Peter Stone,Amy Zhang,Scott Niekum
発行日	2025-06-01 15:15:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.GR, cs.LG, cs.RO | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント