RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

要約

リモートセンシングにおける豊富でよく目立たないマルチモーダルデータは、複雑な視覚リモートセンシング（RS）シーンを人間の言語に合わせるために極めて重要であり、多様なRS解釈タスク全体で特殊なビジョン言語モデルの開発を可能にします。
ただし、RS画像を大規模に豊富な言語セマンティクスで注釈するには、RSとかなりの人間の労働の専門知識が必要であり、費用がかかり、しばしば非現実的です。
この調査では、Google Earth Engine（GEE）プラットフォームから供給された画像のプレーンオープンストリートマップ（OSM）データから、セマンティカルリッチキャプションでマルチモーダルデータセットを生成するために、大規模な言語モデル（LLM）を活用するワークフローを提案します。
このアプローチは、ペアのリモートセンシングデータの生成を容易にし、オープンに利用可能なデータを使用して容易に拡大することができます。
このフレームワーク内で、130万を超えるRS画像を含むマルチモーダルデータセットであるRstellerを提示し、それぞれに2つの記述キャプションが伴います。
広範な実験は、RSTELLEが継続的なトレーニングを通じてRSシーンの理解のための複数の既存のビジョン言語モデルのパフォーマンスを向上させることを示しています。
私たちの方法論は、高品質の注釈付きデータへのアクセスを民主化する一方で、リモートセンシング画像に注釈を付けるために必要な手動の取り組みと専門知識を大幅に削減します。
この進歩は、視覚言語モデリングの進歩を促進し、リモートセンシングの研究とアプリケーションへのより広範な参加を促進します。
rstellerデータセットは、https：//github.com/slytheringe/rstellerで入手できます。

要約(オリジナル)

Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1.3 million RS images, each accompanied by two descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.

arxiv情報

著者	Junyao Ge,Xu Zhang,Yang Zheng,Kaitai Guo,Jimin Liang
発行日	2025-04-16 13:02:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー