Towards Vision-Language Geo-Foundation Model: A Survey

要約

視覚言語基盤モデル (VLFM) は、画像キャプション、画像とテキストの検索、視覚的な質問応答、視覚的なグラウンディングなど、さまざまなマルチモーダルタスクにおいて目覚ましい進歩を遂げました。
ただし、ほとんどの方法は一般的な画像データセットを使用したトレーニングに依存しており、地理空間データの欠如により地球観測のパフォーマンスが低下します。
最近、多数の地理空間画像とテキストのペアデータセットと、それらに基づいて微調整された VLFM が提案されています。
これらの新しいアプローチは、大規模でマルチモーダルな地理空間データを活用して、多様な地理知覚機能を備えた多用途のインテリジェントモデルを構築することを目的としています。これを私たちは視覚言語地理基礎モデル (VLGFM) と呼んでいます。
この文書では、VLGFM を徹底的にレビューし、この分野の最近の発展を要約および分析します。
特に、VLGFM の台頭の背景と動機を紹介し、VLGFM のユニークな研究の重要性を強調します。
次に、データ構築、モデルアーキテクチャ、さまざまなマルチモーダル地理空間タスクのアプリケーションなど、VLGFM で採用されているコアテクノロジーを体系的に要約します。
最後に、今後の研究の方向性に関する洞察、問題点、および議論で終わります。
私たちの知る限り、これは VLGFM に関する最初の包括的な文献レビューです。
https://github.com/zytx121/Awesome-VLGFM で関連作品のトレースを続けています。

要約(オリジナル)

Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at https://github.com/zytx121/Awesome-VLGFM.

arxiv情報

著者	Yue Zhou,Litong Feng,Yiping Ke,Xue Jiang,Junchi Yan,Xue Yang,Wayne Zhang
発行日	2024-06-13 17:57:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Vision-Language Geo-Foundation Model: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー