Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

要約

人間中心の環境で動作するロボットには、ユーザーの指示に基づいてオブジェクトを効果的に操作するための視覚的な接地機能と把握機能の統合が必要です。
この研究は、乱雑なシーンで自然言語を通じて参照されるオブジェクトの把握ポーズを予測する把握合成を参照するタスクに焦点を当てています。
既存のアプローチでは、多くの場合、最初に参照オブジェクトをセグメント化してから適切な把握を提案する多段階パイプラインが採用されており、自然な屋内シーンの複雑さを捉えていないプライベートデータセットまたはシミュレーターで評価されます。
これらの制限に対処するために、OCID データセットの乱雑な屋内シーンに基づいて挑戦的なベンチマークを開発し、それに対して参照式を生成し、4-DoF 把握ポーズに接続します。
さらに、CLIP の視覚的グラウンディング機能を活用して画像とテキストのペアから直接把握合成を学習する、新しいエンドツーエンドモデル (CROG) を提案します。
私たちの結果は、CLIP と事前トレーニング済みモデルのバニラ統合は、私たちの挑戦的なベンチマークではあまり効果がありませんが、CROG は接地と把握の両方の点で大幅な改善を達成していることを示しています。
シミュレーションとハードウェアの両方での広範なロボット実験により、乱雑さを含む困難な対話型物体把握シナリオにおける当社のアプローチの有効性が実証されました。

要約(オリジナル)

Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter.

arxiv情報

著者	Georgios Tziafas,Yucheng Xu,Arushi Goel,Mohammadreza Kasaei,Zhibin Li,Hamidreza Kasaei
発行日	2023-11-09 22:55:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー