Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

要約

意味論的な将来予測は、動的環境をナビゲートする自律システムにとって重要です。
この論文では、統合された効率的なビジュアルシーケンストランスフォーマーアーキテクチャを使用したマルチモーダルな将来の意味予測手法である FUTURIST を紹介します。
私たちのアプローチには、マルチモーダルマスクされたビジュアルモデリング目標と、マルチモーダルトレーニング用に設計された新しいマスキングメカニズムが組み込まれています。
これにより、モデルはさまざまなモダリティからの可視情報を効果的に統合できるようになり、予測精度が向上します。
さらに、計算の複雑さを軽減し、トレーニングパイプラインを合理化し、高解像度のマルチモーダル入力によるエンドツーエンドのトレーニングを可能にする、VAE フリーの階層型トークン化プロセスを提案します。
Cityscapes データセットで FUTURIST を検証し、短期および中期の両方の予測における将来のセマンティックセグメンテーションにおける最先端のパフォーマンスを実証します。
実装コードは https://github.com/Sta8is/FUTURIST で提供されています。

要約(オリジナル)

Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at https://github.com/Sta8is/FUTURIST .

arxiv情報

著者	Efstathios Karypidis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis
発行日	2025-01-14 18:34:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー