VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

要約

ディープニューラルネットワーク（DNN）に基づく知覚において、ドメイン汎化（DG）は依然として重要な課題である。本研究では、セマンティックセグメンテーションにおけるドメイン汎化を強化するために、ネットワークをソースドメインのみで学習し、未見のターゲットドメインで評価するVLTSegを提案する。本手法は、視覚言語モデル固有の意味的頑健性を活用する。まず、従来の視覚のみのバックボーンを、CLIPやEVA-CLIPから事前学習されたエンコーダで置き換え、転移学習設定とすることで、DG分野において、視覚言語による事前学習が、教師ありや自己教師ありの視覚事前学習を大幅に上回ることを見出す。そのため、我々は、領域汎化セグメンテーションのための新しい視覚言語アプローチを提案し、合成GTA5データセットで学習した場合、領域汎化SOTAを7.6%mIoU向上させる。さらに、一般的なCityscapes-to-ACDCベンチマークにおいて76.48%のmIoUを達成し、視覚言語セグメンテーションモデルの優れた汎化能力を示す。さらに、我々のアプローチは、Cityscapesテストセットで86.1%のmIoUを記録し、強力なドメイン内汎化能力を示しています。

要約(オリジナル)

Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.

arxiv情報

著者	Christoph Hümmer,Manuel Schwonberg,Liangwei Zhong,Hu Cao,Alois Knoll,Hanno Gottschalk
発行日	2023-12-04 16:46:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー