QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

要約

ロボットの知能の重要な発現は、自然に対話し、自律的に意思決定を行う能力である。従来のロボット制御のアプローチでは、知覚、計画、意思決定がしばしば区分けされ、システム設計は単純化されるが、異なる情報の流れ間の相乗効果は制限される。このような区分化は、シームレスな自律的推論、意思決定、行動実行を実現する上で課題となる。これらの限界に対処するため、本稿ではQUAR-VLA（Vision-Language-Action tasks for QUAdruped Robots）と名付けられた新しいパラダイムを紹介する。このアプローチは、視覚情報と実行可能なアクションを生成する命令を緊密に統合し、知覚、計画、意思決定を効果的に融合する。中心的なアイデアは、ロボットの全体的な知能を向上させることである。このフレームワークの中で、注目すべき課題は、きめ細かな指示を視覚認識情報と整合させることにある。これは、ロボットがその視覚的観察と調和した詳細な指示を正確に解釈し、それに基づいて行動することを保証することに関わる複雑さを強調している。そこで我々は、多様なモダリティからの視覚情報と指示を入力として統合し、実世界のロボットに実行可能な動作を生成するVLAモデルファミリーであるQUAdruped Robotic Transformer (QUART)を提案し、QUARTモデルのトレーニングのために、ナビゲーション、複雑な地形ロコモーション、全身操作タスクを含む大規模なマルチタスクデータセットであるQUAdruped Robot Dataset (QUARD)を提示する。我々の広範な評価(4000回の評価トライアル)により、我々のアプローチが高性能なロボットポリシーを導き、QUARTが様々な創発的能力を得ることを可能にすることが示された。

要約(オリジナル)

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.

arxiv情報

著者	Pengxiang Ding,Han Zhao,Wenjie Zhang,Wenxuan Song,Min Zhang,Siteng Huang,Ningxi Yang,Donglin Wang
発行日	2025-02-04 13:33:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー