Compositional Kronecker Context Optimization for Vision-Language Models

要約

コンテキスト最適化 (CoOp) は、CLIP のような視覚言語モデルを下流の画像認識タスクに適応させるためのシンプルかつ効果的な手法として登場しました。
それにもかかわらず、新しいタスクに適応しながら、満足のいくベースから新しい、ドメイン、およびタスク間の汎化能力を備えたコンパクトなコンテキストを学習することは依然として課題です。
このような課題に取り組むために、私たちは、構成クロネッカーコンテキスト最適化 (CK-CoOp) と呼ばれる、軽量でありながら一般化可能なアプローチを提案します。
技術的には、CK-CoOp のプロンプトのコンテキストワードは学習可能なベクトルであり、辞書から取得した基本ベクトルを線形結合することによって作成されます。
これらの基本ベクトルは、トークン埋め込み層の重みを量子化することによって取得される学習不可能なコンポーネントと、いくつかの学習可能な小さな行列にクロネッカー積を適用することによって構築される学習可能なコンポーネントで構成されます。
直感的には、この構成構造により、事前にトレーニングされた知識をより多く記憶することで、トレーニングデータに過剰適合するリスクが軽減されます。
一方、クロネッカー積は辞書の学習不可能な制限を破り、それによって最小限の追加パラメータで表現能力を強化します。
広範な実験により、CK-CoOp が基本から新規、ドメインおよびタスク間の汎化評価の下で最先端のパフォーマンスを達成するだけでなく、学習可能なパラメーターが少なく、効率的なトレーニングと推論速度という指標も備えていることが確認されています。

要約(オリジナル)

Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt’s context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.

arxiv情報

著者	Kun Ding,Xiaohui Li,Qiang Yu,Ying Wang,Haojian Zhang,Shiming Xiang
発行日	2024-03-18 10:09:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Compositional Kronecker Context Optimization for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー