Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

要約

私たちは、ロボットがターゲットオブジェクトを開いたクラッターで把握し、指定された場所に移動する必要がある言語で条件付けられたピックと場所のタスクを研究します。
いくつかのアプローチでは、Vision Foundationモデルの機能を使用してエンドツーエンドポリシーを学習し、大きなデータセットが必要です。
その他は、ゼロショット設定でファンデーションモデルを組み合わせて、カスケードエラーに苦しんでいます。
さらに、彼らは主にビジョンと言語の基礎モデルを活用しており、アクション事前に焦点を当てています。
このホワイトペーパーでは、ビジョン、言語、行動から基礎の事前を統合することにより、効果的なポリシーを開発することを目指しています。
$^2 $を提案します。これは、1つの注意レイヤーを学習することにより、無条件のアクションプライエアを3Dビジョン言語プライアーと整列させるアクション事前アライメントメソッドを提案します。
アライメント定式化により、当社のポリシーは、より少ないデータでトレーニングし、ゼロショットの一般化機能を維持できます。
ピックと場所の両方のアクションの共有ポリシーが各タスクのパフォーマンスを向上させることを示し、アクションのマルチモーダル性に対応するためのポリシー適応スキームを導入します。
シミュレーションと現実世界での広範な実験は、私たちのポリシーが、散らかったピックと場所の両方のタスクの両方でより少ないステップでより高いタスクの成功率を達成し、目に見えないオブジェクトと言語指示に効果的に一般化することを示しています。
ビデオとコードはhttps://xukechun.github.io/papers/a2で入手できます。

要約(オリジナル)

We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.

arxiv情報

著者	Kechun Xu,Xunlong Xia,Kaixuan Wang,Yifei Yang,Yunxuan Mao,Bing Deng,Rong Xiong,Yue Wang
発行日	2025-04-02 09:52:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー