Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

要約

ビジョン言語モデル（VLM）の最近の進歩により、言語誘導ロボットプランナーの開発が加速していますが、ブラックボックスの性質には、実際の展開に重要な安全保証と解釈可能性が欠けていることがよくあります。
逆に、古典的なシンボリックプランナーは厳格な安全検証を提供しますが、セットアップには重要な専門知識が必要です。
現在のギャップを埋めるために、このペーパーでは、検証可能、解釈可能、および自律的なロボット行動を可能にするためのハイブリッド計画フレームワークであるVilain-Tampを提案しています。
Vilain-Tampは、3つの主要なコンポーネントで構成されています。（1）Vilain（Vision-Language Interpreter） – 追加のドメイン固有のトレーニングなしでマルチモーダル入力を構造化された問題仕様に変換する以前のフレームワーク、
主要な操作段階のスキル、および（3）モーションおよびタスク計画コンポーネントからの失敗したソリューションの試みに関する具体的なフィードバックを受け取り、適応された論理と幾何学的実現可能性の制約をVilainに戻すために、仕様を改善し、さらに改善することができる是正計画モジュール。
調理領域でのいくつかの挑戦的な操作タスクに関するフレームワークを評価します。
提案されている閉ループ矯正アーキテクチャは、是正計画なしと比較して、Vilain-Tampの平均成功率が30％以上高いことを実証します。

要約(オリジナル)

While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) ViLaIn (Vision-Language Interpreter) – A prior framework that converts multimodal inputs into structured problem specifications using off-the-shelf VLMs without additional domain-specific training, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning and can utilize learning-based skills for key manipulation phases, and (3) a corrective planning module which receives concrete feedback on failed solution attempts from the motion and task planning components and can feed adapted logic and geometric feasibility constraints back to ViLaIn to improve and further refine the specification. We evaluate our framework on several challenging manipulation tasks in a cooking domain. We demonstrate that the proposed closed-loop corrective architecture exhibits a more than 30% higher mean success rate for ViLaIn-TAMP compared to without corrective planning.

arxiv情報

著者	Jeremy Siburian,Keisuke Shirai,Cristian C. Beltran-Hernandez,Masashi Hamaya,Michael Görner,Atsushi Hashimoto
発行日	2025-06-03 18:00:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー