Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

要約

CLIP のような事前トレーニング済みの対照的な大規模ビジョン言語モデル (VLM) は、下流のデータセットで優れたパフォーマンスを提供することで、視覚表現の学習に革命をもたらしました。
VLM は、データセットに関連するプロンプトを設計することによって、ダウンストリームデータセットに適応された 0 ショットです。
このような迅速なエンジニアリングでは、ドメインの専門知識と検証データセットが活用されます。
一方、GPT-4 のような事前学習済み生成モデルの最近の開発により、これらのモデルは高度なインターネット検索ツールとして使用できるようになりました。
また、任意の構造に視覚情報を提供するために操作することもできます。
この研究では、GPT-4 を使用して視覚的に説明的なテキストを生成できることと、GPT-4 を使用して CLIP をダウンストリームタスクに適応させる方法を示します。
CLIP のデフォルトプロンプトと比較すると、EuroSAT (~7%)、DTD (~7%)、SUN397 (~4.6%)、CUB (~3.3%) などの特殊なきめ細かいデータセットでの 0 ショット転送精度が大幅に向上していることがわかります。
また、最近提案された CoCoOP よりも平均で最大 2%、4 つの特化されたきめ細かいデータセットで 4% 以上優れた一般化可能な分類器を構築するために最適な文を選択する方法を学習するシンプルな少数ショットアダプターも設計します。
承認され次第、コード、プロンプト、および補助テキストデータセットをリリースします。

要約(オリジナル)

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP’s default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. We will release the code, prompts, and auxiliary text dataset upon acceptance.

arxiv情報

著者	Mayug Maniparambil,Chris Vorster,Derek Molloy,Noel Murphy,Kevin McGuinness,Noel E. O’Connor
発行日	2023-07-21 15:49:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー