SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

要約

リモートセンシング基盤モデル (RSFM) に関するこれまでの研究により、地球観測の汎用モデルに向けた計り知れない可能性が明らかになりました。
それにもかかわらず、これらの研究は主に、時間的および地理的コンテキストのモデリングを行わない単一のモダリティに焦点を当てており、多様なタスクの能力を妨げています。
この研究では、2,150 万の時系列を含む精選されたマルチモーダルリモートセンシング画像 (RSI) データセットで事前トレーニングされた汎用 10 億規模モデル SkySense を紹介します。
SkySense には、光学データと合成開口レーダー (SAR) データの時間シーケンスを入力として受け取る因数分解されたマルチモーダル時空間エンコーダーが組み込まれています。
このエンコーダーは、さまざまなモーダルおよび空間粒度にわたる表現を学習するために、私たちが提案する多粒度対照学習によって事前にトレーニングされています。
ジオコンテキストの手掛かりによる RSI 表現をさらに強化するために、RSI のマルチモーダル時空間特徴に基づいて地域認識プロトタイプを学習するジオコンテキストプロトタイプ学習を導入します。
私たちの知る限り、SkySense はこれまでで最大のマルチモーダル RSFM であり、そのモジュールを柔軟に組み合わせたり個別に使用したりして、さまざまなタスクに対応できます。
これは、シングルモーダルからマルチモーダル、静的から時間的、分類から位置特定まで、7 つのタスクにわたって 16 のデータセットを網羅する徹底的な評価において、顕著な一般化機能を実証します。
SkySense は、すべてのテストシナリオにおいて、最近の 18 の RSFM を上回っています。
具体的には、GFM、SatLas、Scale-MAE などの最新モデルを、それぞれ平均 2.76%、3.67%、3.61% と大幅に上回っています。
将来の研究や地球観測アプリケーションを容易にするために、事前にトレーニングされた重みをリリースします。

要約(オリジナル)

Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI’s multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.

arxiv情報

著者	Xin Guo,Jiangwei Lao,Bo Dang,Yingying Zhang,Lei Yu,Lixiang Ru,Liheng Zhong,Ziyuan Huang,Kang Wu,Dingxiang Hu,Huimei He,Jian Wang,Jingdong Chen,Ming Yang,Yongjun Zhang,Yansheng Li
発行日	2024-03-22 16:46:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー