PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

要約

アフォーダンス理解、つまり 3D オブジェクト上の実行可能な領域を特定するタスクは、ロボットシステムが物理世界と関わり、その中で動作できるようにする上で重要な役割を果たします。
視覚言語モデル (VLM) は、ロボット操作のための高度な推論と長期的な計画には優れていますが、人間とロボットの効果的なインタラクションに必要な微妙な物理的特性を把握するにはまだ不十分です。
この論文では、点群の 3D アフォーダンス理解を強化するために、事前トレーニングされた言語モデルに埋め込まれた広範なマルチモーダル知識を利用する革新的なフレームワークである PAVLM (点群アフォーダンスビジョン言語モデル) を紹介します。
PAVLM は、視覚的セマンティクスを強化するために、幾何学的ガイド付き伝播モジュールと大規模言語モデル (LLM) からの隠れた埋め込みを統合します。
言語面では、Llama-3.1 モデルに洗練されたコンテキストを認識したテキストを生成するよう促し、より深い意味論的な手がかりで指導入力を強化します。
3D-AffordanceNet ベンチマークの実験結果は、PAVLM が完全点群と部分点群の両方でベースライン手法を上回っており、特に 3D オブジェクトの新しいオープンワールドアフォーダンスタスクへの一般化において優れていることを示しています。
詳細については、プロジェクトサイト pavlm-source.github.io をご覧ください。

要約(オリジナル)

Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt Llama-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects. For more information, visit our project site: pavlm-source.github.io.

arxiv情報

著者	Shang-Ching Liu,Van Nhiem Tran,Wenkai Chen,Wei-Lun Cheng,Yen-Lin Huang,I-Bin Liao,Yung-Hui Li,Jianwei Zhang
発行日	2024-10-15 12:53:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー