Do Visual-Language Grid Maps Capture Latent Semantics?

要約

視覚言語モデル（VLM）は最近、マップ内のセマンティクスを表すためにVLMの潜在表現、つまり埋め込みを使用してロボットマッピングで導入されました。
彼らは、限られた人間が作成したラベルのセットから、複雑な現実世界環境で動作し、人間と対話する際にロボットに非常に役立つ、人間が作成したラベルのオープンボキャブラリーシーンの理解に移行することを可能にします。
この方法で構築されたマップがナビゲーションなどの下流タスクをサポートしているという逸話的な証拠がありますが、これらの埋め込みを使用したマップの品質の厳密な分析が欠落しています。
この論文では、VLMSを使用して作成されたマップの品質を分析する方法を提案します。
マップの品質の2つの重要な特性という2つの重要な特性を調査します。クエリ性と明確さです。
クエリ性の評価は、埋め込みから情報を取得する機能に対処します。
マップ内の明確さを調査して、抽象的なセマンティッククラスを表し、表現の一般化特性を評価するための抽象的なセマンティッククラスを表現し、マップをインターマップする能力を研究します。
これらのプロパティを評価し、2つのエンコーダー、LSEGとOpensegを使用して、MatterPort3Dデータセットの実際のデータを使用して、2つの最先端のマッピング方法、VLMAPとOpenSceneを評価するメトリックを提案します。
私たちの調査結果は、3D機能がクエリ性を改善する一方で、それらはスケール不変ではないのに対し、画像ベースの埋め込みは複数のマップ解像度に一般化することを示しています。
これにより、画像ベースのメソッドがより小さなマップサイズを維持することができます。これは、実際の展開でこれらの方法を使用するために重要です。
さらに、エンコーダーの選択が結果に影響を与えることを示します。
結果は、適切にしきい値を適切にしきい値にすることがオープンな問題であることを意味します。

要約(オリジナル)

Visual-language models (VLMs) have recently been introduced in robotic mapping using the latent representations, i.e., embeddings, of the VLMs to represent semantics in the map. They allow moving from a limited set of human-created labels toward open-vocabulary scene understanding, which is very useful for robots when operating in complex real-world environments and interacting with humans. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is missing. In this paper, we propose a way to analyze the quality of maps created using VLMs. We investigate two critical properties of map quality: queryability and distinctness. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate intra-map distinctness to study the ability of the embeddings to represent abstract semantic classes and inter-map distinctness to evaluate the generalization properties of the representation. We propose metrics to evaluate these properties and evaluate two state-of-the-art mapping methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. Our findings show that while 3D features improve queryability, they are not scale invariant, whereas image-based embeddings generalize to multiple map resolutions. This allows the image-based methods to maintain smaller map sizes, which can be crucial for using these methods in real-world deployments. Furthermore, we show that the choice of the encoder has an effect on the results. The results imply that properly thresholding open-vocabulary queries is an open problem.

arxiv情報

著者	Matti Pekkanen,Tsvetomila Mihaylova,Francesco Verdoja,Ville Kyrki
発行日	2025-03-04 12:17:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Visual-Language Grid Maps Capture Latent Semantics?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー