Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

要約

トランスフォーマーは、大規模なデータセットに合わせてスケーリングする能力により、ビジョンと自然言語処理に革命をもたらしました。
しかし、ロボット操作では、データは限られており、高価です。
適切な問題の定式化により、トランスフォーマーから引き続き恩恵を受けることができますか?
この問題は、マルチタスク 6 DoF 操作のための言語条件付き動作クローニングエージェントである PerAct を使用して調査します。
PerAct は言語目標と RGB-D ボクセル観測を Perceiver Transformer でエンコードし、「次善のボクセルアクションを検出する」ことで離散化されたアクションを出力します。
2D 画像で動作するフレームワークとは異なり、ボクセル化された観察およびアクションスペースは、6-DoF ポリシーを効率的に学習するための強力な構造的事前情報を提供します。
この定式化では、1 つのマルチタスク Transformer を 18 の RLBench タスク (249 のバリエーション) と 7 つの実世界のタスク (18 のバリエーション) に対して、タスクごとにいくつかのデモンストレーションからトレーニングします。
私たちの結果は、PerAct が非構造化画像からアクションへのエージェントおよび 3D ConvNet ベースラインよりも、幅広いテーブルトップタスクで大幅に優れていることを示しています。

要約(オリジナル)

Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ‘detecting the next best voxel action’. Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

arxiv情報

著者	Mohit Shridhar,Lucas Manuelli,Dieter Fox
発行日	2022-09-12 17:51:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー