Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

要約

事前トレーニング済みビジョン言語モデル (VLM) の出現により、下流タスク用にモデルを微調整するために多大な努力が払われてきました。
効率的な微調整方法の設計は進歩しているにもかかわらず、そのような方法ではモデルのパラメーターにアクセスする必要がありますが、モデルの所有者はモデルの所有権を保護するためにモデルをブラックボックスとして提供することを選択することが多いため、これは困難な場合があります。
この論文では、ブラックボックス VLM をダウンストリームタスクに合わせて微調整するための \textbf{C}ollabo\textbf{ra}tive \textbf{F}ine-\textbf{T}uning (\textbf{CraFT}) アプローチを提案します。
1 つは、モデルの入力プロンプトと出力予測にのみアクセスできます。
CraFT は、テキストプロンプトを学習するためのプロンプト生成モジュールと、残差スタイルでの出力予測を強化するための予測改良モジュールの 2 つのモジュールで構成されます。
さらに、これらのモジュール全体で一貫した最適化を促進するために、補助的な予測一貫性損失を導入します。
これらのモジュールは、新しい協調トレーニングアルゴリズムによって最適化されています。
15 のデータセットにわたる少数ショット分類に関する広範な実験により、CraFT の優位性が実証されました。
結果は、CraFT が 16 ショットデータセットとわずか 8,000 クエリで約 12\% という適切なゲインを達成したことを示しています。
さらに、CraFT はトレーニングを高速化し、ホワイトボックス方式と比較して、デプロイメントに使用するメモリフットプリントの約 1/80 のみを犠牲にし、犠牲にするのはわずか 1.62\% です。

要約(オリジナル)

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a \textbf{C}ollabo\textbf{ra}tive \textbf{F}ine-\textbf{T}uning (\textbf{CraFT}) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12\% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62\% compared to the white-box method.

arxiv情報

著者	Zhengbo Wang,Jian Liang,Ran He,Zilei Wang,Tieniu Tan
発行日	2024-02-06 14:53:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー