Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

要約

さまざまな時期やスペクトルバンドで衛星ベースのイメージングシステムによって収集されたような地理空間ラスターデータは、幅広いインパクトの高いアプリケーションを可能にする大きな可能性を秘めています。
この潜在的なものは、複数のチャネルとセンシングモダリティにわたって空間的および時間的にコンテキスト化された豊富な情報に由来します。
最近の研究は、このような地理空間データの既存の自己監視学習アプローチを採用しています。
ただし、スケーラブルなモデルアーキテクチャには及ばず、ますます多くのチャネルとモダリティに直面した場合、柔軟性と計算の非効率性になります。
これらの制限に対処するために、3つの重要な革新を備えた低ランク効率の高い空間スペクトル視力変圧器を導入します。i）低次元空間およびスペクトル注意成分のKroneckerの積を通じて高次元空間スペクトルの注意を近似する注意ブロック。
ii）各空間スペクトルパッチの連続性と物理的特性の両方を保持する連続位置チャネル埋め込み層。
およびiii）隣接するパッチへの注意を制約することにより、ローカルの空間依存性を悪用する知覚フィールドマスク。
提案されたイノベーションを評価するために、このような地理空間ラスターデータの包括的なベンチマークとして機能するGFMベンチを構築します。
統合された位置とチャネルのマスキング戦略を備えたハイパースペクトルマスクされた自動エンコーダーフレームワークを使用して、vitを少なくします。
実験結果は、提案された方法が、最先端のマルチモーダル地理空間基礎モデルに対して競争力のあるパフォーマンスを達成しながら、計算効率が高いため、クロスサテライト一般化タスクでそれらを上回ることを示しています。
フレームワークの柔軟性と拡張性により、幅広いモダリティとチャネルを含む将来の地理空間データ分析タスクの有望な方向になります。

要約(オリジナル)

Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker’s product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

arxiv情報

著者	Haozhe Si,Yuxuan Wan,Minh Do,Deepak Vasisht,Han Zhao,Hendrik F. Hamann
発行日	2025-03-26 16:15:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー