Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

要約

地球観測のビジョン言語モデル（EO）は通常、視覚的なデータのスペクトルに唯一のモデル入力として依存しているため、衛星が記録したマルチスペクトルチャネルで利用可能な豊富なスペクトル情報を活用できません。
したがって、大規模なマルチスペクトルデータセットで対照的な学習で事前に訓練された最初のビジョン言語モデルであるLlama3-MS-Clipを紹介し、スペクトル範囲の拡張によるパフォーマンスの向上に関するレポートを紹介します。
さらに、マルチスペクトルデータの最大の画像キャプションデータセットを提示します。これは、LLAMA3-llava-nextおよび序曲マップデータを使用して生成された100万個のセンチネル2サンプルと、対応するテキスト説明で構成されます。
ドメインの専門家によって検証されたスケーラブルなキャプションパイプラインを開発します。
さまざまな複雑さの3つのデータセットを使用して、マルチスペクトルゼロショット画像の分類と検索のLLAMA3-MS-CLIPを評価します。
我々の結果は、LLAMA3-MS-CLIPが他のRGBベースのアプローチを大幅に上回り、分類精度を平均で +6.77％、2番目のベストモデルと比較して +4.63％のMAPで検索パフォーマンスを改善することを示しています。
私たちの結果は、多宇宙視覚学習の関連性を強調しています。
画像キャプションデータセット、コード、およびモデルの重みは、https：//github.com/ibm/ms-clipで入手できます。

要約(オリジナル)

Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated using Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by +6.77% on average and retrieval performance by +4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. The image-caption dataset, code, and model weights are available at https://github.com/IBM/MS-CLIP.

arxiv情報

著者	Clive Tinashe Marimo,Benedikt Blumenstiel,Maximilian Nitsche,Johannes Jakubik,Thomas Brunschwiler
発行日	2025-06-13 11:24:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー