Conceptual Codebook Learning for Vision-Language Models

要約

この論文では、ビジョン言語モデル (VLM) の新しい微調整手法である概念コードブック学習 (CoCoLe) を提案します。これは、下流のタスクで VLM を微調整しながら、VLM の汎化能力を向上させるという課題に対処します。
ショット設定。
私たちは、テクスチャ、形状、色などの視覚的概念はドメイン間で自然に移行可能であり、一般化タスクにおいて重要な役割を果たすことを認識しています。
この興味深い発見に動機付けられて、キーとしての視覚的概念と値としての概念的なプロンプトで構成される概念的なコードブックを学習します。これは、画像エンコーダーの出力とテキストエンコーダーの入力の間のリンクとして機能します。
具体的には、特定の画像について、コードブックを活用して、クラスの埋め込みに関連付けられた最も関連性の高い概念的なプロンプトを特定し、分類を実行します。
さらに、ローショットシナリオでのオーバーフィッティングの問題を軽減するために、手作りのコンセプトキャッシュを正則化として組み込みます。
この概念的なコードブック学習方法により、視覚的モダリティと言語的モダリティの間の調整を強化できることがわかりました。
広範な実験結果は、当社の CoCoLe メソッドが、基礎から新しい一般化、クロスデータセット評価、ドメイン一般化タスクなどのさまざまな評価設定にわたって、既存の最先端のメソッドよりも著しく優れていることを示しています。
詳細なアブレーション研究により、CoCoLe の各コンポーネントの有効性がさらに確認されています。

要約(オリジナル)

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder’s outputs and the text encoder’s inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.

arxiv情報

著者	Yi Zhang,Ke Yu,Siqi Wu,Zhihai He
発行日	2024-07-02 15:16:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Conceptual Codebook Learning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー