In the Era of Prompt Learning with Vision-Language Models

要約

CLIP のような大規模な基盤モデルは、強力なゼロショット汎化を示していますが、ドメインのシフトに苦戦しており、適応性が制限されています。
私たちの研究では、ドメイン一般化 (DG) のための新しいドメインに依存しないプロンプト学習戦略である \textsc{StyLIP} を紹介します。
StyLIP は、スタイルプロジェクターを使用してドメイン固有のプロンプトトークンを学習し、それらをコンテンツ機能と組み合わせることで、CLIP のビジョンエンコーダーのビジュアルスタイルとコンテンツを分離します。
対照的にトレーニングされたこのアプローチは、ドメイン間でのシームレスな適応を可能にし、複数の DG ベンチマークで最先端の手法を上回ります。
さらに、CLIP のフリーズビジョンバックボーンを活用して、画像スタイルとコンテンツ機能を通じてドメイン不変のプロンプトを学習する、教師なしドメインアダプテーション (DA) のための AD-CLIP を提案します。
エントロピー最小化を使用して埋め込み空間内のドメインを調整することにより、AD-CLIP は、ターゲットドメインサンプルのみが利用可能な場合でも、ドメインシフトを効果的に処理します。
最後に、非構造化環境における新規クラスまたはまれなクラスの特定に焦点を当て、リモートセンシングにおけるセマンティックセグメンテーションの即時学習を使用したクラス発見に関する今後の研究について概説します。
これにより、複雑な現実世界のシナリオにおいて、より適応性が高く一般化可能なモデルへの道が開かれます。

要約(オリジナル)

Large-scale foundation models like CLIP have shown strong zero-shot generalization but struggle with domain shifts, limiting their adaptability. In our work, we introduce \textsc{StyLIP}, a novel domain-agnostic prompt learning strategy for Domain Generalization (DG). StyLIP disentangles visual style and content in CLIP`s vision encoder by using style projectors to learn domain-specific prompt tokens and combining them with content features. Trained contrastively, this approach enables seamless adaptation across domains, outperforming state-of-the-art methods on multiple DG benchmarks. Additionally, we propose AD-CLIP for unsupervised domain adaptation (DA), leveraging CLIP`s frozen vision backbone to learn domain-invariant prompts through image style and content features. By aligning domains in embedding space with entropy minimization, AD-CLIP effectively handles domain shifts, even when only target domain samples are available. Lastly, we outline future work on class discovery using prompt learning for semantic segmentation in remote sensing, focusing on identifying novel or rare classes in unstructured environments. This paves the way for more adaptive and generalizable models in complex, real-world scenarios.

arxiv情報

著者	Ankit Jha
発行日	2024-11-07 17:31:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

In the Era of Prompt Learning with Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー