Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

要約

本研究では、ロバストな単眼奥行きスケール復元法を提案する。単眼的奥行き推定は、(1)スケール情報なしで正規化または逆奥行きを提供する相対的奥行き推定と、(2)絶対スケールで奥行きを回復するメトリック奥行き推定の2つの方向に大別できる。実用的な下流タスクのために絶対的なスケール情報を得るために、相対的な深度マップのスケールを回復するためにテキスト情報を利用することは、非常に有望なアプローチである。しかし、1つの画像に異なる視点やスタイルによる複数の記述が存在するため、テキスト記述の違いが縮尺復元処理に大きな影響を与えることが示されている。この問題に対処するため、我々の手法であるVGLDは、テキスト記述とともに、対応する画像から高レベルの意味情報を取り込むことで、テキスト情報の影響を安定化させる。このアプローチはテキストの曖昧さを解決し、相対深度マップにグローバルに適用可能な線形変換パラメータ（スカラー）のセットを頑健に出力し、最終的にメトリックスケールの精度で深度予測を生成する。屋内シーン(NYUv2)と屋外シーン(KITTI)の両方を用いて、いくつかの一般的な相対奥行きモデル(MiDas, DepthAnything)間で我々の手法を検証する。我々の結果は、VGLDが複数のデータセットで訓練されたとき、普遍的なアライメントモジュールとして機能し、ゼロショットのシナリオでも強力な性能を達成することを示している。コードはhttps://github.com/pakinwu/VGLD。

要約(オリジナル)

We propose a robust method for monocular depth scale recovery. Monocular depth estimation can be divided into two main directions: (1) relative depth estimation, which provides normalized or inverse depth without scale information, and (2) metric depth estimation, which involves recovering depth with absolute scale. To obtain absolute scale information for practical downstream tasks, utilizing textual information to recover the scale of a relative depth map is a highly promising approach. However, since a single image can have multiple descriptions from different perspectives or with varying styles, it has been shown that different textual descriptions can significantly affect the scale recovery process. To address this issue, our method, VGLD, stabilizes the influence of textual information by incorporating high-level semantic information from the corresponding image alongside the textual description. This approach resolves textual ambiguities and robustly outputs a set of linear transformation parameters (scalars) that can be globally applied to the relative depth map, ultimately generating depth predictions with metric-scale accuracy. We validate our method across several popular relative depth models(MiDas, DepthAnything), using both indoor scenes (NYUv2) and outdoor scenes (KITTI). Our results demonstrate that VGLD functions as a universal alignment module when trained on multiple datasets, achieving strong performance even in zero-shot scenarios. Code is available at: https://github.com/pakinwu/VGLD.

arxiv情報

著者	Bojin Wu,Jing Chen
発行日	2025-05-05 14:57:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー