ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models

要約

最近の研究では、ブラックボックスプロンプトチューニング（BBPT）と呼ばれる、ブラックボックスビジョン言語モデルのプロンプトチューニングのためのさまざまなアプローチが導入されています。
BBPTはかなりの可能性を実証していますが、多くの既存の方法には過度の数のクエリ（つまり、関数評価）が必要であることがよくあります。
この問題に取り組むために、純粋にブラックボックス設定で効率的で堅牢な迅速な最適化を可能にする新しいアプローチであるZeroth-Orderの固有の次元プロンプトチューニング（ZIP）を提案します。
ZIPの重要なアイデアは、問題の次元とゼロオーダー勾配推定の分散を減らすことです。
これを達成し、低ランク表現のプロンプトを再パラメータ化し、推定勾配の本質的な次元クリッピングを設計します。
標準ベンチマークの13以上の視覚言語タスクでZIPを評価し、最高のパフォーマンスの代替BBPTメソッドと比較して、少ないショット精度で約6％、クエリ効率が48％の平均改善を達成し、新しい最新の最新技術を確立することを示しています。
アブレーション分析はさらに、提案されたクリッピングメカニズムが、高価なハイパーパラメーター検索の結果と一致するクリッピングしきい値を手動で選択する必要なく、堅牢で最適であることを示しています。

要約(オリジナル)

Recent studies have introduced various approaches for prompt-tuning black-box vision-language models, referred to as black-box prompt-tuning (BBPT). While BBPT has demonstrated considerable potential, it is often found that many existing methods require an excessive number of queries (i.e., function evaluations), which poses a significant challenge in real-world scenarios where the number of allowed queries is limited. To tackle this issue, we propose Zeroth-order Intrinsic-dimensional Prompt-tuning (ZIP), a novel approach that enables efficient and robust prompt optimization in a purely black-box setting. The key idea of ZIP is to reduce the problem dimensionality and the variance of zeroth-order gradient estimates, such that the training is done fast with far less queries. We achieve this by re-parameterizing prompts in low-rank representations and designing intrinsic-dimensional clipping of estimated gradients. We evaluate ZIP on 13+ vision-language tasks in standard benchmarks and show that it achieves an average improvement of approximately 6% in few-shot accuracy and 48% in query efficiency compared to the best-performing alternative BBPT methods, establishing a new state of the art. Our ablation analysis further shows that the proposed clipping mechanism is robust and nearly optimal, without the need to manually select the clipping threshold, matching the result of expensive hyperparameter search.

arxiv情報

著者	Seonghwan Park,Jaehyeon Jeong,Yongjun Kim,Jaeho Lee,Namhoon Lee
発行日	2025-04-09 12:56:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー