ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

要約

視覚的に実行可能なアフォーダンスは、ロボット工学における革新的なアプローチとして登場し、操作前にインタラクション領域を認識することに焦点を当てています。
従来の方法では、ピクセルサンプリングに依存して、成功したインタラクションサンプルを特定したり、アフォーダンスマッピング用のポイントクラウドを処理したりしています。
ただし、これらのアプローチは計算量が多く、多様で動的な環境に適応するのが困難です。
この論文では、大規模な事前トレーニング済みビジョントランスフォーマー (ViT) を使用して、多関節オブジェクトの最適なインタラクション領域を予測するように設計されたフレームワークである ManipGPT を紹介します。
私たちは、シミュレーションと現実のギャップを埋め、現実世界への適用性を高めるために、9.9k のシミュレーション画像と実際の画像のデータセットを作成しました。
この小さなデータセットでビジョントランスフォーマーを微調整することで、パーツレベルのアフォーダンスセグメンテーションが大幅に改善され、モデルのコンテキスト内セグメンテーション機能がロボット操作シナリオに適応されました。
これにより、部品レベルのアフォーダンスマスクを生成し、インピーダンス適応ポリシーと組み合わせることにより、シミュレートされた環境と現実世界全体で効果的な操作が可能になり、複雑なデータセットや認識システムの必要性が十分に排除されます。

要約(オリジナル)

Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model’s in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.

arxiv情報

著者	Taewhan Kim,Hojin Bae,Zeming Li,Xiaoqi Li,Iaroslav Ponomarenko,Ruihai Wu,Hao Dong
発行日	2024-12-18 07:08:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー