PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

要約

環境とロボットの物理的な到達可能性を理解することは、タスクの実行に不可欠です。
最先端のビジョン言語モデル（VLM）は環境認識に優れていますが、ロボットの身体的到達可能性の理解がないため、具体化された視覚的推論タスクで不正確または非実用的な反応を生成することがよくあります。
この問題に対処するために、多様なロボット間の物理的到達可能性の統一された表現、つまりスペース物理的到達可能性マップ（S-Pマップ）と、この到達可能性情報を視覚的推論に統合する視覚言語モデルであるPhysVLMを提案します。
具体的には、S-Pマップは、特定のロボット構成とは無関係に、ロボットの物理的な到達可能性を一般化された空間表現に抽象化し、モデルがロボット固有のパラメーターではなくリーチ性機能に焦点を合わせます。
その後、PhysVLMは、追加の機能エンコーダを組み込んでS-Pマップを処理することにより、従来のVLMアーキテクチャを拡張し、一般的なビジョン言語機能を損なうことなく、モデルが物理的な到達可能性について推論できるようにします。
PhysVLMを訓練および評価するために、シミュレートされた環境と実際の環境の両方で6つの異なるロボットのタスクを含む、大規模なマルチロボットデータセットと挑戦的なベンチマークEQA-PHYSを構築しました。
実験結果は、PhysVLMが既存のモデルを上回り、EQA-PHYSでGPT-4Oよりも14 \％の改善を達成し、Robovqa-valやOpeneqaベンチマークのロボマンバや空間vlmなどの高度な具体化されたVLMを上回ることを示しています。
さらに、S-PマップはさまざまなVLMとの強い互換性を示しており、GPT-4O-MINIへの統合により、7.1 \％のパフォーマンスが向上します。

要約(オリジナル)

Understanding the environment and a robot’s physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot’s physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14\% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1\% performance improvement.

arxiv情報

著者	Weijie Zhou,Manli Tao,Chaoyang Zhao,Haiyun Guo,Honghui Dong,Ming Tang,Jinqiao Wang
発行日	2025-03-11 14:34:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー