RationalVLA: A Rational Vision-Language-Action Model with Dual System

要約

現実世界のロボット展開の基本的な要件は、自然言語の指示を理解し、対応する能力です。
既存の言語条件付き操作タスクは、通常、指示が環境と完全に一致していると仮定します。
この仮定は、指示が曖昧、無関係、または実行不可能である可能性のある現実的なシナリオでの堅牢性と一般化を制限します。
この問題に対処するために、合理的な操作（Rama）を紹介します。これは、目に見えない実行可能な指示と拒否されるべき欠陥のある指示の両方でモデルに挑戦する新しいベンチマークです。
RAMAでは、視覚、物理、セマンティック、モーション、安全性、コンテキスト外の6つの次元にまたがる多様な欠陥のある指示を含む、14,000を超えるサンプルを含むデータセットを構築します。
さらに、合理的なビジョン言語アクションモデル（RationalVLA）を提案します。
これは、学習可能な潜在スペース埋め込みを導入することにより、高レベルの視覚言語モデルと低レベルの操作ポリシーを統合するロボットアームのデュアルシステムです。
この設計により、RationalVLAは指示を推論し、実行不可能なコマンドを拒否し、操作を効果的に実行できます。
実験は、RationalVLAが標準的な操作タスクの競争力を維持しながら、Ramaの最先端のベースラインを14.5％高い成功率と0.94の平均タスク長さよりも優れていることを示しています。
実際の試験では、実際のアプリケーションにおける有効性と堅牢性をさらに検証します。
プロジェクトページはhttps://irpn-eai.github.io/rationalvlaです。

要約(オリジナル)

A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications. Our project page is https://irpn-eai.github.io/rationalvla.

arxiv情報

著者	Wenxuan Song,Jiayi Chen,Wenxue Li,Xu He,Han Zhao,Pengxiang Ding Shiyan Su,Feilong Tang,Xuelian Cheng,Donglin Wang,Zongyuan Ge,Xinhu Zheng,Zhe Liu,Hesheng Wang,Yunhui Liu,Haoang Li
発行日	2025-06-12 15:44:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RationalVLA: A Rational Vision-Language-Action Model with Dual System

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー