Unified Human Localization and Trajectory Prediction with Monocular Vision

要約

従来の人間の軌道予測モデルは、クリーンなキュレーションされたデータに依存しており、ロボットアプリケーションでは非現実的な特殊な機器または手動ラベル付けが必要です。
既存の予測因子は、ノイズの多い入力で使用すると、堅牢性に影響を与える観察をきれいにするために過度にフィットする傾向があります。
この作業では、単眼カメラのみを使用してローカリゼーションと予測タスクを共同で解決する変圧器ベースのフレームワークであるモノタンスモーション（MT）を提案します。
私たちのフレームワークには、2つの主要なモジュールがあります。バードアイビュー（BEV）のローカリゼーションと軌道予測です。
BEVのローカリゼーションモジュールは、より滑らかな局所化のための新しい方向性の損失によって強化された2Dヒトのポーズを使用している人の位置を推定します。
軌道予測モジュールは、これらの推定値からの将来の動きを予測します。
統一されたフレームワークと両方のタスクを共同でトレーニングすることにより、私たちの方法は、ノイズの多い入力で作られた現実世界のシナリオでより堅牢であることを示しています。
キュレーションされたデータセットと非キュレーションデータセットの両方でMTネットワークを検証します。
キュレーションされたデータセットでは、MTはBEVのローカリゼーションと軌道予測に関するベースラインモデルよりも約12％の改善を達成します。
実際の非キュレーションデータセットでは、実験結果は、MTが同様のパフォーマンスレベルを維持し、その堅牢性と一般化能力を強調していることを示しています。
このコードは、https：//github.com/vita-epfl/monotransmotionで入手できます。

要約(オリジナル)

Conventional human trajectory prediction models rely on clean curated data, requiring specialized equipment or manual labeling, which is often impractical for robotic applications. The existing predictors tend to overfit to clean observation affecting their robustness when used with noisy inputs. In this work, we propose MonoTransmotion (MT), a Transformer-based framework that uses only a monocular camera to jointly solve localization and prediction tasks. Our framework has two main modules: Bird’s Eye View (BEV) localization and trajectory prediction. The BEV localization module estimates the position of a person using 2D human poses, enhanced by a novel directional loss for smoother sequential localizations. The trajectory prediction module predicts future motion from these estimates. We show that by jointly training both tasks with our unified framework, our method is more robust in real-world scenarios made of noisy inputs. We validate our MT network on both curated and non-curated datasets. On the curated dataset, MT achieves around 12% improvement over baseline models on BEV localization and trajectory prediction. On real-world non-curated dataset, experimental results indicate that MT maintains similar performance levels, highlighting its robustness and generalization capability. The code is available at https://github.com/vita-epfl/MonoTransmotion.

arxiv情報

著者	Po-Chien Luan,Yang Gao,Celine Demonsant,Alexandre Alahi
発行日	2025-03-05 14:18:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified Human Localization and Trajectory Prediction with Monocular Vision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー