Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

要約

GPSが除外するオフロード環境における堅牢なクロスビュー3-DOFローカリゼーションは、（1）繰り返しの植生と構造化されていない地形からの知覚的な曖昧さ、および（2）シーンの外観を大きく変化させ、古い衛星画像とのアライメントを妨げる季節変化のために困難なままです。
これに対処するために、正確なローカリゼーションに不可欠な方向性認識を維持しながら、視点とシーズン不変の表現を学習する自己監視されたクロスビュービデオローカリゼーションフレームワークであるMovixを紹介します。
Movixは、方向性の識別を強化するためのポーズ依存の肯定的なサンプリング戦略を採用し、季節的な手がかりからのショートカット学習を阻止するために、一時的にハードネガティブマイニングを整列させます。
モーション情報に基づいたフレームサンプラーは、空間的に多様なフレームを選択し、軽量の時間的アグリゲーターは、曖昧なものをダウンウェイトしながら、幾何学的に整列した観測を強調します。
推論では、Movixは、手作りモデルの代わりに学習したクロスビューマッチングモジュールを使用して、モンテカルロローカリゼーションフレームワーク内で実行されます。
エントロピー誘導温度スケーリングにより、堅牢なマルチハポテシス追跡と視覚的なあいまいさの下での自信のある収束が可能になります。
Tartandrive 2.0データセットのMovixを評価し、30分未満のデータでトレーニングし、12.29 kmを超えるテストを評価します。
時代遅れの衛星画像にもかかわらず、Movixは93％の時間の25メートル以内で、目に見えない地域では50メートル以内に100％以内に局在し、環境固有の調整なしに最先端のベースラインを上回ります。
さらに、異なるロボットプラットフォームを備えた地理的に異なるサイトから、実際のオフロードデータセットの一般化を実証します。

要約(オリジナル)

Robust cross-view 3-DoF localization in GPS-denied, off-road environments remains challenging due to (1) perceptual ambiguities from repetitive vegetation and unstructured terrain, and (2) seasonal shifts that significantly alter scene appearance, hindering alignment with outdated satellite imagery. To address this, we introduce MoViX, a self-supervised cross-view video localization framework that learns viewpoint- and season-invariant representations while preserving directional awareness essential for accurate localization. MoViX employs a pose-dependent positive sampling strategy to enhance directional discrimination and temporally aligned hard negative mining to discourage shortcut learning from seasonal cues. A motion-informed frame sampler selects spatially diverse frames, and a lightweight temporal aggregator emphasizes geometrically aligned observations while downweighting ambiguous ones. At inference, MoViX runs within a Monte Carlo Localization framework, using a learned cross-view matching module in place of handcrafted models. Entropy-guided temperature scaling enables robust multi-hypothesis tracking and confident convergence under visual ambiguity. We evaluate MoViX on the TartanDrive 2.0 dataset, training on under 30 minutes of data and testing over 12.29 km. Despite outdated satellite imagery, MoViX localizes within 25 meters of ground truth 93% of the time, and within 50 meters 100% of the time in unseen regions, outperforming state-of-the-art baselines without environment-specific tuning. We further demonstrate generalization on a real-world off-road dataset from a geographically distinct site with a different robot platform.

arxiv情報

著者	Zhiyun Deng,Dongmyeong Lee,Amanda Adkins,Jesse Quattrociocchi,Christian Ellis,Joydeep Biswas
発行日	2025-06-05 17:10:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spatiotemporal Contrastive Learning for Cross-View Video Localization in Unstructured Off-road Terrains

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー