GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

要約

超高解像度（UHR）リモートセンシング（RS）画像は、地球観測に貴重なデータを提供しますが、2つの重要なボトルネックのために既存のマルチモーダルファンデーションモデルに課題をもたらします。
データ不足に対処するために、SuperRS-VQA（平均8,376 $ \ Times $ 8,376）およびHighRS-VQA（Avg。2,000$ \ Times $ 1,912）を導入します。
トークンの爆発を緩和するために、パイロット研究ではRS画像の著しい冗長性が明らかになります。重要な情報はオブジェクト中心のトークンの小さなサブセットに集中していますが、バックグラウンドトークン（海または森林など）を剪定することでパフォーマンスを改善することさえできます。
これらの調査結果に動機付けられて、2つの戦略を提案します。バックグラウンドトークンプルーニングと固定トークン選択を提案し、重要なセマンティクスを維持しながらメモリフットプリントを削減します。これらの手法を統合すると、8K $ 8K解像度で$ 8K解像度に入力を処理できる最初のRSに焦点を当てたマルチモーダル大規模言語モデルを紹介します。
SuperRS-VQAとHighRS-VQAで訓練されたGeollava-8Kは、XLRSベンチに新しい最先端を設定します。

要約(オリジナル)

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

arxiv情報

著者	Fengxiang Wang,Mingshuo Chen,Yueying Li,Di Wang,Haotian Wang,Zonghao Guo,Zefan Wang,Boqi Shan,Long Lan,Yulin Wang,Hongzhen Wang,Wenjing Yang,Bo Du,Jing Zhang
発行日	2025-05-27 16:05:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー