Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments

要約

汎用的なラージ言語モデル (LLM) とラージビジョンモデル (VLM) の出現により、意味的に強化されたマップの構築が合理化され、ロボットが高レベルの推論と計画をその表現に根付かせることができるようになりました。
最も広く使用されているセマンティックマップ形式の 1 つは 3D シーングラフで、メトリック (低レベル) 情報とセマンティック (高レベル) 情報の両方をキャプチャします。
ただし、これらのマップは多くの場合、静的な世界を前提としていますが、家庭やオフィスなどの実際の環境は動的です。
これらのスペースの小さな変更でも、タスクのパフォーマンスに大きな影響を与える可能性があります。
ロボットを動的な環境に統合するには、ロボットが変化を検出し、シーングラフをリアルタイムで更新する必要があります。
この更新プロセスは本質的にマルチモーダルであり、人間のエージェント、ロボット自身の認識システム、時間、動作など、さまざまなソースからの入力を必要とします。
この研究では、これらのマルチモーダル入力を活用してリアルタイム操作中にシーングラフの一貫性を維持するフレームワークを提案し、有望な初期結果を提示し、将来の研究のロードマップの概要を示します。

要約(オリジナル)

The advent of generalist Large Language Models (LLMs) and Large Vision Models (VLMs) have streamlined the construction of semantically enriched maps that can enable robots to ground high-level reasoning and planning into their representations. One of the most widely used semantic map formats is the 3D Scene Graph, which captures both metric (low-level) and semantic (high-level) information. However, these maps often assume a static world, while real environments, like homes and offices, are dynamic. Even small changes in these spaces can significantly impact task performance. To integrate robots into dynamic environments, they must detect changes and update the scene graph in real-time. This update process is inherently multimodal, requiring input from various sources, such as human agents, the robot’s own perception system, time, and its actions. This work proposes a framework that leverages these multimodal inputs to maintain the consistency of scene graphs during real-time operation, presenting promising initial results and outlining a roadmap for future research.

arxiv情報

著者	Emilio Olivastri,Jonathan Francis,Alberto Pretto,Niko Sünderhauf,Krishan Rana
発行日	2024-11-05 09:31:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー