GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

要約

最近の研究では、大規模言語モデル (LLM) がロボットのタスク計画のための命令の基礎付けを容易にできることが示されています。
この進歩にもかかわらず、既存の研究のほとんどは、LLM が環境情報を理解するのを支援するために生の画像を利用することに主に焦点を当ててきました。
ただし、このアプローチでは観察範囲が制限されるだけでなく、通常、広範なマルチモーダルなデータ収集と大規模なモデルが必要になります。
この論文では、グラフベースのロボット命令デコンポーザー (GRID) と呼ばれる新しいアプローチを提案します。これは、画像の代わりにシーングラフを活用して、グローバルシーン情報を認識し、特定の命令のサブタスクを反復的に計画します。
私たちの方法は、LLM とグラフアテンションネットワークを通じてオブジェクトの属性と関係をグラフにエンコードし、事前定義されたロボットの動作とシーングラフ内のターゲットオブジェクトで構成されるサブタスクを予測するための命令機能を統合します。
この戦略により、ロボットは環境内で広く観察される意味論的な知識をシーングラフから取得できるようになります。
GRID をトレーニングして評価するために、グラフベースのロボットタスク計画用の合成データセットを生成するデータセット構築パイプラインを確立します。
実験の結果、私たちの方法はサブタスク精度で 25.4% 以上、タスク精度で 43.6% 以上 GPT-4 よりも優れていることが示されました。
さらに、私たちの手法は推論ごとに 0.11 秒のリアルタイム速度を達成します。
目に見えないシーンとさまざまな数のオブジェクトを含むシーンのデータセットに対して行われた実験では、GRID のタスク精度が最大 3.8% 低下することが実証され、その堅牢なクロスシーン汎化能力が示されました。
私たちは物理シミュレーションと現実世界の両方でメソッドを検証します。
詳細については、プロジェクトページ https://jackyzengl.github.io/GRID.github.io/ をご覧ください。

要約(オリジナル)

Recent works have shown that Large Language Models (LLMs) can facilitate the grounding of instructions for robotic task planning. Despite this progress, most existing works have primarily focused on utilizing raw images to aid LLMs in understanding environmental information. However, this approach not only limits the scope of observation but also typically necessitates extensive multimodal data collection and large-scale models. In this paper, we propose a novel approach called Graph-based Robotic Instruction Decomposer (GRID), which leverages scene graphs instead of images to perceive global scene information and iteratively plan subtasks for a given instruction. Our method encodes object attributes and relationships in graphs through an LLM and Graph Attention Networks, integrating instruction features to predict subtasks consisting of pre-defined robot actions and target objects in the scene graph. This strategy enables robots to acquire semantic knowledge widely observed in the environment from the scene graph. To train and evaluate GRID, we establish a dataset construction pipeline to generate synthetic datasets for graph-based robotic task planning. Experiments have shown that our method outperforms GPT-4 by over 25.4% in subtask accuracy and 43.6% in task accuracy. Moreover, our method achieves a real-time speed of 0.11s per inference. Experiments conducted on datasets of unseen scenes and scenes with varying numbers of objects demonstrate that the task accuracy of GRID declined by at most 3.8%, showcasing its robust cross-scene generalization ability. We validate our method in both physical simulation and the real world. More details can be found on the project page https://jackyzengl.github.io/GRID.github.io/.

arxiv情報

著者	Zhe Ni,Xiaoxin Deng,Cong Tai,Xinyue Zhu,Qinghongbing Xie,Weihang Huang,Xiang Wu,Long Zeng
発行日	2024-03-11 02:20:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー