CubeRobot: Grounding Language in Rubik’s Cube Manipulation via Vision-Language Model

要約

高レベルでルービックのキューブ定理を証明することは、人間レベルの空間的想像力と論理的思考と推論における顕著なマイルストーンを表しています。
複雑なビジョンシステムと固定アルゴリズムに依存している伝統的なルービックのキューブロボットは、しばしば複雑で動的なシナリオに適応するのに苦労しています。
この制限を克服するために、3×3 Rubikのキューブを解くために調整された新しいビジョン言語モデル（VLM）であるCuberobotを紹介し、具体化されたエージェントにマルチモーダルの理解と実行機能を強化します。
人間はさまざまなキューブ状態を網羅している複数レベルのタスク（合計43のサブタスク）を含む複数レベルのタスク（合計43のサブタスク）を含むCubecot画像データセットを使用しました。
VLM生成された計画クエリからタスク関連の機能を抽出するためのパラダイムであるデュアルループビジョンコットアーキテクチャとメモリストリームを組み込み、したがって、Cuberobotが独立した計画、意思決定、反映、および高レベルのRubikのキューブタスクの高レベルのルービックタスクの高度と個別の管理を可能にします。
さらに、低レベルのRubikのキューブ修復タスクでは、Cuberobotは中レベルのタスクで100％に似た100％の高精度を達成し、高レベルのタスクで80％の精度を達成しました。

要約(オリジナル)

Proving Rubik’s Cube theorems at the high level represents a notable milestone in human-level spatial imagination and logic thinking and reasoning. Traditional Rubik’s Cube robots, relying on complex vision systems and fixed algorithms, often struggle to adapt to complex and dynamic scenarios. To overcome this limitation, we introduce CubeRobot, a novel vision-language model (VLM) tailored for solving 3×3 Rubik’s Cubes, empowering embodied agents with multimodal understanding and execution capabilities. We used the CubeCoT image dataset, which contains multiple-level tasks (43 subtasks in total) that humans are unable to handle, encompassing various cube states. We incorporate a dual-loop VisionCoT architecture and Memory Stream, a paradigm for extracting task-related features from VLM-generated planning queries, thus enabling CubeRobot to independent planning, decision-making, reflection and separate management of high- and low-level Rubik’s Cube tasks. Furthermore, in low-level Rubik’s Cube restoration tasks, CubeRobot achieved a high accuracy rate of 100%, similar to 100% in medium-level tasks, and achieved an accuracy rate of 80% in high-level tasks.

arxiv情報

著者	Feiyang Wang,Xiaomin Yu,Wangyu Wu
発行日	2025-03-25 02:23:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CubeRobot: Grounding Language in Rubik’s Cube Manipulation via Vision-Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー