Language Models as Black-Box Optimizers for Vision-Language Models

要約

Web スケールのデータセットで事前トレーニングされたビジョン言語モデル (VLM) は、さまざまなビジョンおよびマルチモーダルタスクにわたって優れた機能を実証しています。
現在、VLM の微調整方法は主にホワイトボックス設定で動作し、バックプロパゲーション用のモデルパラメーターへのアクセスが必要です。
ただし、多くの VLM は独自のデータに依存しており、オープンソースではないため、微調整のためのホワイトボックスアプローチの使用は制限されています。
ChatGPT のような人気のあるプライベート大規模言語モデル (LLM) が依然として言語ベースのユーザーインターフェイスを提供していることを考慮して、私たちは自然言語プロンプトを介して VLM 用の新しい微調整アプローチを開発することを目指しています。これにより、モデルパラメーター、機能の埋め込み、
またはログを出力します。
このセットアップでは、CLIP を使用した少数ショット画像分類の例示的なタスクで最適なテキストプロンプトを検索するためのブラックボックスオプティマイザーとしてチャットベースの LLM を採用することを提案します。
具体的には、現在のプロンプトの精度を評価し、LLM にテキストフィードバックに基づいてプロンプトを改良するよう依頼することで、効果的なプロンプトに収束する自動「山登り」手順を採用しています。これはすべて、人間が介入することなく会話プロセス内で行われます。
困難なワンショット学習セットアップにおいて、私たちのシンプルなアプローチは、ImageNet を含む 11 のデータセット全体でホワイトボックス連続プロンプト手法 CoOp を平均 1.5% 上回りました。
また、私たちのアプローチは、OpenAI の手動で作成されたプロンプトよりも優れており、反復 APE などの他のブラックボックス手法よりも効率的です。
さらに、肯定的なプロンプトと否定的なプロンプトの両方を組み込んだ会話型フィードバックの利点を強調し、LLM がテキストフィードバックの暗黙的な「勾配」方向を利用して、より効率的な検索を行えることを示唆しています。
最後に、私たちの戦略によって生成されたテキストプロンプトは、より解釈しやすいだけでなく、ブラックボックス方式でさまざまな CLIP アーキテクチャ間で適切に転送できることがわかりました。

要約(オリジナル)

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic ‘hill-climbing’ procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method CoOp by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI’s manually crafted prompts and is more efficient than other black-box methods like iterative APE. Additionally, we highlight the advantage of conversational feedback incorporating both positive and negative prompts, suggesting that LLMs can utilize the implicit ‘gradient’ direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.

arxiv情報

著者	Samuel Yu,Shihong Liu,Zhiqiu Lin,Deepak Pathak,Deva Ramanan
発行日	2023-09-12 04:03:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models as Black-Box Optimizers for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー