Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

要約

脚式ロボットは、多様な環境を移動し、さまざまな障害物を乗り越えることができる。例えば、捜索・救助活動において、脚式ロボットは瓦礫を乗り越え、隙間を這い回り、行き止まりからナビゲートすることができる。しかし、ロボットのコントローラーは、このような様々な障害物にインテリジェントに対応する必要があり、そのためには予期せぬ異常なシナリオをうまく処理する必要がある。このことは、現在の学習方法にとって未解決の課題であり、予期せぬ事態のロングテールに対する汎化に、人間の厳しい監視なしでは苦戦することが多い。この問題に対処するため、我々は、視覚言語モデル（VLM）が持つ世界の構造に関する幅広い知識と常識的な推論能力を活用し、困難で曖昧な状況に対処する脚式ロボットを支援する方法を研究する。VLM-予測制御（VLM-PC）とは、VLMを用いてその場で適応的な行動選択を引き出すために重要であると我々が発見した2つの重要な構成要素を組み合わせたシステムである。我々は、Go1四脚ロボットを用いて、行き止まりや登ったり這ったりする、いくつかの困難な実世界の障害物コースでVLM-PCを評価した。我々の実験により、VLMは相互作用の履歴と将来の計画を推論することで、ロボットが自律的に知覚、ナビゲート、行動することを可能にすることが示された。

要約(オリジナル)

Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions. For example, in a search and rescue mission, a legged robot could climb over debris, crawl through gaps, and navigate out of dead ends. However, the robot’s controller needs to respond intelligently to such varied obstacles, and this requires handling unexpected and unusual scenarios successfully. This presents an open challenge to current learning methods, which often struggle with generalization to the long tail of unexpected situations without heavy human supervision. To address this issue, we investigate how to leverage the broad knowledge about the structure of the world and commonsense reasoning capabilities of vision-language models (VLMs) to aid legged robots in handling difficult, ambiguous situations. We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection with VLMs: (1) in-context adaptation over previous robot interactions and (2) planning multiple skills into the future and replanning. We evaluate VLM-PC on several challenging real-world obstacle courses, involving dead ends and climbing and crawling, on a Go1 quadruped robot. Our experiments show that by reasoning over the history of interactions and future plans, VLMs enable the robot to autonomously perceive, navigate, and act in a wide range of complex scenarios that would otherwise require environment-specific engineering or human guidance.

arxiv情報

著者	Annie S. Chen,Alec M. Lessing,Andy Tang,Govind Chada,Laura Smith,Sergey Levine,Chelsea Finn
発行日	2024-07-02 21:00:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー