Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

要約

人間は、抽象的な取扱説明書を解釈することにより、複雑な操作タスクを理解して実行する並外れた能力を持っています。
ただし、ロボットの場合、抽象的な命令を解釈して実行可能アクションに変換することができないため、この機能は大きな課題のままです。
このペーパーでは、Robotが高レベルのマニュアル指示に導かれた複雑なアセンブリタスクを実行できるようにする新しいフレームワークであるManual2Skillを紹介します。
私たちのアプローチは、ビジョン言語モデル（VLM）を活用して、教育画像から構造化された情報を抽出し、この情報を使用して階層アセンブリグラフを構築します。
これらのグラフは、部品、サブアセンブリ、およびそれらの間の関係を表しています。
タスクの実行を容易にするために、ポーズ推定モデルは、各アセンブリステップでコンポーネントの相対的な6Dポーズを予測します。
同時に、モーション計画モジュールは、実際のロボット実装のための実用的なシーケンスを生成します。
いくつかの現実世界のIKEA家具アイテムを正常に組み立てることにより、Manual2Skillの有効性を実証します。
このアプリケーションは、効率と精度の両方で長老型操作タスクを管理する能力を強調し、取扱説明書から学習の実用性を大幅に向上させます。
この作業は、人間の能力に似た方法で複雑な操作タスクを理解し、実行できるロボットシステムを進める際の一歩を踏み出します。

要約(オリジナル)

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.

arxiv情報

著者	Chenrui Tie,Shengxiang Sun,Jinxuan Zhu,Yiwei Liu,Jingxiang Guo,Yue Hu,Haonan Chen,Junting Chen,Ruihai Wu,Lin Shao
発行日	2025-02-14 11:25:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー