Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

要約

大規模なビジョン言語モデル（VLM）は、タスク固有のトレーニングなしで、プロンプトを解決することにより、多様なタスクを解決するように指示できる基礎モデルと見なされています。
一見明白な質問を調べます。これは、セマンティックセグメンテーションのためにVLMSを効果的に促す方法です。
そのために、分散排出データセットコレクションのテキストまたは視覚プロンプトのいずれかによって導かれたいくつかの最近のモデルのセグメンテーションパフォーマンスを体系的に評価します。
スケーラブルなプロンプトスキーム、いくつかのショットプロンプトのセマンティックセグメンテーションを紹介します。
VLMSは、特定のセグメンテーションタスクのために訓練された専門モデルに遅れをとっており、交差点統合のメトリックで平均で約30％であることがわかります。
さらに、テキストプロンプトと視覚的なプロンプトは補完的であることがわかります。他のモードが解決できる多くの例では、2つのモードのそれぞれが失敗します。
私たちの分析は、最も効果的な迅速なモダリティを予測できることがパフォーマンスの11％の改善につながる可能性があることを示唆しています。
私たちの調査結果に動機付けられているため、テキストと視覚的なプロンプトの両方を組み合わせた非常にシンプルなトレーニングフリーのベースラインであるPromptMatcherを提案し、最高のテキストプロンプトVLMを2.5％上回る最先端の結果を達成し、少数のショットプロンプトのセマンティック分割で最上位の視覚採用VLMを3.5％上回ります。

要約(オリジナル)

Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.

arxiv情報

著者	Niccolo Avogaro,Thomas Frick,Mattia Rigotti,Andrea Bartezzaghi,Filip Janicki,Cristiano Malossi,Konrad Schindler,Roy Assaf
発行日	2025-03-25 13:36:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー