CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

要約

大規模な 3D シーンでのテキスト説明のローカライズは、本質的にあいまいな作業です。
それにもかかわらず、これは一般的な概念を説明する際に生じます。
街中のすべての信号機。
このような概念に基づいた推論を容易にするためには、配布形式でのテキストのローカライゼーションが必要です。
この論文では、テキスト記述を条件としてカメラポーズの分布を生成します。
このような生成を容易にするために、ノイズの多い 6DoF カメラのポーズを条件付きで妥当な位置に拡散する拡散ベースのアーキテクチャを提案します。
条件付き信号は、事前トレーニングされたテキストエンコーダーを使用して、テキストの説明から派生します。
テキストの説明とポーズの分布の間の関係は、事前トレーニングされた視覚言語モデル、つまり CLIP を通じて確立されます。
さらに、3D ガウススプラッティングを使用して潜在的なポーズをレンダリングし、視覚的な推論を通じて、誤って配置されたサンプルをテキストの説明とよりよく一致する位置に誘導することで、分布の候補ポーズをさらに洗練できることを示します。
標準的な検索方法と学習ベースのアプローチの両方と比較することにより、私たちの方法の有効性を実証します。
私たちが提案した手法は、5 つの大規模データセットすべてにわたってこれらのベースラインを常に上回っています。
私たちのソースコードとデータセットは一般公開されます。

要約(オリジナル)

Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available.

arxiv情報

著者	Qi Ma,Runyi Yang,Bin Ren,Ender Konukoglu,Luc Van Gool,Danda Pani Paudel
発行日	2025-01-15 17:59:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー