INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

要約

本稿では、自然言語を通じて人間と対話し、雑然とした特定の物体を把握するロボットシステム「INVIGORATE」を紹介する。
オブジェクトは、互いに遮蔽したり、遮ったり、さらには重なったりする場合があります。
INVIGORATE は、いくつかの課題を具体化しています。(i) 入力言語表現と RGB 画像から、他の遮蔽オブジェクトの中からターゲットオブジェクトを推測する、(ii) 画像からオブジェクトブロッキング関係 (OBR) を推測する、(iii) 複数のステップからなる計画を合成する
対象となるオブジェクトを明確にし、それをうまく把握するための質問をします。
私たちは、物体検出、視覚的グラウンディング、質問生成、OBR 検出と把握のために個別のニューラルネットワークをトレーニングします。
トレーニングデータセットに応じて、無制限のオブジェクトカテゴリと言語表現が可能になります。
しかし、視覚認識のエラーや人間の言語のあいまいさは避けられず、ロボットのパフォーマンスに悪影響を及ぼします。
これらの不確実性を克服するために、学習されたニューラルネットワークモジュールを統合する部分的に観察可能なマルコフ決定プロセス (POMDP) を構築します。
おおよその POMDP 計画を通じて、ロボットは観察履歴を追跡し、曖昧さをなくすための質問をして、対象物体を識別して把握する最適に近い一連のアクションを実現します。
INVIGORATE は、モデルベースの POMDP 計画とデータ駆動型の深層学習の利点を組み合わせています。
フェッチロボットでの INVIGORATE の予備実験では、自然言語対話を使用して乱雑なオブジェクトを把握するこの統合アプローチの大きな利点が示されています。
デモビデオは https://youtu.be/zYakh80SGcU でご覧いただけます。

要約(オリジナル)

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot’s performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

arxiv情報

著者	Hanbo Zhang,Yunfan Lu,Cunjun Yu,David Hsu,Xuguang Lan,Nanning Zheng
発行日	2024-01-08 02:22:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー