Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

要約

器用な把握を、オブジェクトとコンテキストのバリエーションのある新しいシーンにワンショットで移行することは、困難な問題でした。
大規模ビジョンモデルから抽出された特徴フィールドにより、3D シーン全体での意味論的な対応が可能になりましたが、その特徴は点ベースであり、物体表面に限定されているため、手と物体の相互作用のための複雑な意味論的特徴分布をモデル化する機能が制限されています。
この研究では、個々の点の特徴ではなく点間の関連性をモデル化することによって、3D 空間で意味を意識した密な特徴フィールドを表現するための \textit{神経注意フィールド} を提案します。
その中心となるのは、任意の 3D クエリポイントとすべてのシーンポイントの間のクロスアテンションを計算し、クエリポイント機能にアテンションベースの集約を提供するトランスフォーマーデコーダです。
さらに、手作業によるデモンストレーションを行わずに、わずか数個の 3D 点群からトランスデコーダをトレーニングするための自己教師ありフレームワークを提案します。
トレーニング後、ワンショットのデモンストレーションから意味論を意識した器用な把握のための新しいシーンにアテンションフィールドを適用できます。
実験の結果、私たちの方法は、エンドエフェクターがタスクに関連するシーン領域に集中するように促すことで、より良い最適化ランドスケープを提供し、その結果、特徴フィールドベースの方法と比較して実際のロボットの成功率が大幅に向上することが示されました。

要約(オリジナル)

One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the \textit{neural attention field} for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.

arxiv情報

著者	Qianxu Wang,Congyue Deng,Tyler Ga Wei Lum,Yuanpei Chen,Yaodong Yang,Jeannette Bohg,Yixin Zhu,Leonidas Guibas
発行日	2024-10-30 14:06:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー