A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

要約

リモートセンシングにおけるシーンの理解は、雪、雲、霧も含まれる可能性のあるさまざまな土地利用地域や海岸地域などの複雑な環境の正確な表現を生成する際に課題に直面することがよくあります。
これに対処するために、私たちは Spectral LLaVA という名前の視覚言語フレームワークを紹介します。これは、マルチスペクトルデータを視覚言語調整技術と統合して、シーンの表現と説明を強化します。
Sentinel-2 の BigEarthNet v2 データセットを使用して、RGB ベースのシーン記述によるベースラインを確立し、マルチスペクトル情報の組み込みによる大幅な改善をさらに実証します。
私たちのフレームワークは、SpectralGPT のビジョンバックボーンを凍結したままにしながら、アライメントのために軽量の線形投影レイヤーを最適化します。
私たちの実験には、シーン分類と記述生成を共同で実行するための線形プローブと言語モデリングを使用したシーン分類が含まれます。
私たちの結果は、Spectral LLaVA が、特に RGB データだけでは不十分であることが判明したシナリオに対して、詳細かつ正確な記述を生成すると同時に、SpectralGPT の特徴を意味的に意味のある表現に洗練することで分類パフォーマンスを向上させる能力を強調しています。

要約(オリジナル)

Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA’s ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.

arxiv情報

著者	Enes Karanfil,Nevrez Imamoglu,Erkut Erdem,Aykut Erdem
発行日	2025-01-17 12:12:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー