VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

要約

単眼深度スケールの回復のための堅牢な方法を提案します。
単眼の深さ推定は、スケール情報なしで正規化または逆深度を提供する相対深度推定、および（2）絶対スケールで深さの回復を伴うメトリック深度推定を提供する2つの主要な方向に分けることができます。
実用的なダウンストリームタスクの絶対スケール情報を取得するには、テキスト情報を利用して相対深度マップのスケールを回復することは非常に有望なアプローチです。
ただし、単一の画像は、異なる視点から、またはさまざまなスタイルから複数の説明を持つことができるため、異なるテキストの説明がスケール回復プロセスに大きく影響することが示されています。
この問題に対処するために、私たちの方法であるVGLDは、テキストの説明とともに、対応する画像から高レベルのセマンティック情報を組み込むことにより、テキスト情報の影響を安定させます。
このアプローチは、テキストのあいまいさを解決し、相対深さマップにグローバルに適用できる一連の線形変換パラメーター（スカラー）を堅牢に出力し、最終的にメトリックスケールの精度で深さ予測を生成します。
屋内シーン（NYUV2）と屋外シーン（kitti）の両方を使用して、いくつかの一般的な相対深さモデル（MIDAS、DEPTHANYTHING）にわたってメソッドを検証します。
私たちの結果は、複数のデータセットでトレーニングされたときにVGLDがユニバーサルアライメントモジュールとして機能し、ゼロショットシナリオでも強力なパフォーマンスを達成することを示しています。
コードはhttps://github.com/pakinwu/vgldで入手できます。

要約(オリジナル)

We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: https://github.com/pakinwu/VGLD.

arxiv情報

著者	Bojin Wu,Jing Chen
発行日	2025-05-06 03:06:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー