DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

要約

器用な把握は、ロボット工学の根本的でありながら挑戦的な問題のままです。
汎用ロボットは、任意のシナリオで多様なオブジェクトを把握できる必要があります。
ただし、既存の研究は通常、単一オブジェクトの設定や限られた環境などの特定の仮定に依存しており、一般化につながります。
私たちのソリューションは、高レベルのタスクプランナーとして事前に訓練されたビジョン言語モデルを利用し、低レベルのアクションコントローラーとして拡散ベースのポリシーを学習する階層的なフレームワークであるDexGraspVLAです。
重要な洞察は、ドメインシフトの緩和のために模倣学習を効果的に適用できる、ドメイン不変の表現に多様な言語と視覚入力を繰り返し変換することにあります。
したがって、幅広い現実世界のシナリオにわたって堅牢な一般化を可能にします。
特に、私たちの方法は、「ゼロショット」環境で、数千の目に見えないオブジェクト、照明、背景の組み合わせの下で90以上の成功率を達成します。
経験的分析により、環境の変動全体にわたる内部モデルの動作の一貫性がさらに確認され、それによって設計を検証し、その一般化パフォーマンスを説明します。
私たちの仕事が、一般的な器用な握りを達成する上で一歩前進することを願っています。
デモとコードはhttps://dexgraspvla.github.io/にあります。

要約(オリジナル)

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a “zero-shot” environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

arxiv情報

著者	Yifan Zhong,Xuchuan Huang,Ruochong Li,Ceyao Zhang,Yitao Liang,Yaodong Yang,Yuanpei Chen
発行日	2025-03-05 16:23:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー