DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

要約

器用な把握は、ロボット工学の根本的でありながら挑戦的な問題のままです。
汎用ロボットは、任意のシナリオで多様なオブジェクトを把握できる必要があります。
ただし、既存の研究は通常、単一オブジェクトの設定や限られた環境などの制限的な仮定に依存しており、一般化が制約されます。
RGBの画像認識と言語の指示に基づいて、乱雑なシーンで巧妙な脱線のための階層的なフレームワークであるDexGraspVlaを提示します。
事前に訓練されたビジョン言語モデルを高レベルのタスクプランナーとして利用し、低レベルのアクションコントローラーとして拡散ベースのポリシーを学習します。
堅牢な一般化を実現するための重要な洞察は、ドメインシフトの緩和により模倣学習を効果的に適用できる、基礎モデルを介して、多様な言語と視覚入力をドメイン不変の表現に繰り返し変換することにあります。
特に、私たちの方法は、「ゼロショット」環境での何千もの目に見えないオブジェクト、照明、および背景の組み合わせの下で90以上の成功率を達成します。
経験的分析により、環境の変動全体にわたる内部モデルの動作の一貫性が確認され、それにより設計を検証し、その一般化パフォーマンスを説明します。
DexGraspVLAは、フリーフォームの長老迅速な実行、敵対的なオブジェクトへの堅牢性と人間の妨害、および故障回復も示しています。
非摂食オブジェクトへの拡張アプリケーションは、その一般性をさらに証明します。
コード、モデル、およびビデオは、dexgraspvla.github.ioで入手できます。

要約(オリジナル)

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on restrictive assumptions, such as single-object settings or limited environments, leading to constrained generalization. We present DexGraspVLA, a hierarchical framework for general dexterous grasping in cluttered scenes based on RGB image perception and language instructions. It utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight to achieve robust generalization lies in iteratively transforming diverse language and visual inputs into domain-invariant representations via foundation models, where imitation learning can be effectively applied due to the alleviation of domain shift. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ‘zero-shot’ environment. Empirical analysis confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. DexGraspVLA also demonstrates free-form long-horizon prompt execution, robustness to adversarial objects and human disturbance, and failure recovery, which are rarely achieved simultaneously in prior work. Extended application to nonprehensile object grasping further proves its generality. Code, model, and video are available at dexgraspvla.github.io.

arxiv情報

著者	Yifan Zhong,Xuchuan Huang,Ruochong Li,Ceyao Zhang,Yitao Liang,Yaodong Yang,Yuanpei Chen
発行日	2025-05-22 08:27:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー