ECO: Ensembling Context Optimization for Vision-Language Models

要約

最近、画像認識ではパラダイムシフトが見られ、視覚言語モデルがテキストプロンプトに基づいて少数ショットの分類を実行するために使用されるようになりました。
これらの中で、CLIP モデルは、潜在空間で画像とカスタムテキストプロンプトを照合することにより、ゼロショット転送の顕著な機能を示しました。
これにより、CLIP の分類機能を最大限に活用するためのテキストコンテキストのエンジニアリングまたは学習に焦点を当てたいくつかの研究への道が開かれました。
この論文では、画像分類のためのプロンプトのアンサンブルを学習することでこの傾向に従います。
単一のトレーニング可能なプロンプトに依存するよりも、多様でおそらく短いコンテキストを学習すると、結果が大幅かつ一貫して改善されることを示します。
特に、推論時に追加コストなしで優れた少数ショット機能を報告します。
11 の異なるベンチマークで私たちのアプローチの機能を実証します。

要約(オリジナル)

Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP’s classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.

arxiv情報

著者	Lorenzo Agnolucci,Alberto Baldrati,Francesco Todino,Federico Becattini,Marco Bertini,Alberto Del Bimbo
発行日	2023-07-26 09:31:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ECO: Ensembling Context Optimization for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー