Tree of Attributes Prompt Learning for Vision-Language Models

要約

迅速な学習は、ダウンストリームタスクにビジョン言語モデルを適応させるのに効果的であることが証明されています。
ただし、既存のメソッドは通常、カテゴリ名を取得するためにカテゴリ名のみで学習可能なプロンプトトークンを追加します。これは、カテゴリ名に示されている豊富なコンテキストを完全に活用できません。
この問題に対処するために、属性のTree Prompt Learning（TAP）を提案します。これは、最初に各カテゴリの「概念 – 属性 – 説明」構造を持つ属性のツリーを生成するようにLLMSに指示し、ビジョンとテキストのプロンプトトークンで階層を学習します。
一連の非構造化された説明を使用してカテゴリ名を補強する既存の方法とは異なり、私たちのアプローチは、LLMSのクラス名に関連する構造化された知識グラフを本質的に蒸留します。
さらに、私たちのアプローチでは、対応する視覚属性を明示的に学習するように設計されたテキストとビジョンのプロンプトを導入し、ドメインの専門家として効果的に機能します。
さらに、クラス名に基づいて生成された一般的および多様な説明は、特定の画像に間違っているか、存在しない可能性があります。
この不整合に対処するために、インスタンス固有のテキスト機能を抽出するためのビジョン条件付きプーリングモジュールをさらに紹介します。
広範な実験結果は、私たちのアプローチが、ゼロショットベースからノベルへの一般化、クロスダタセット転送、および11の多様なデータセットにわたる少数の分類に関する最先端の方法よりも優れていることを示しています。
コードはhttps://github.com/hhenryd/tapで入手できます。

要約(オリジナル)

Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a ‘concept – attribute – description’ structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts. Additionally, the general and diverse descriptions generated based on the class names may be wrong or absent in the specific given images. To address this misalignment, we further introduce a vision-conditional pooling module to extract instance-specific text features. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets. Code is available at https://github.com/HHenryD/TAP.

arxiv情報

著者	Tong Ding,Wanhua Li,Zhongqi Miao,Hanspeter Pfister
発行日	2025-04-21 15:37:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tree of Attributes Prompt Learning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー