Descriminative-Generative Custom Tokens for Vision-Language Models

要約

このペーパーでは、ビジョン言語モデル（VLM）で新しい概念を表すためにカスタムトークンを学習する可能性を探ります。
私たちの目的は、新しい入力クエリを形成するために単語でよく構成しながら、差別的タスクと生成タスクの両方に効果的なトークンを学ぶことです。
ターゲットの概念は、テキストを使用して説明されている画像の小さなセットと親の概念の観点から指定されています。
クリップテキスト機能を操作し、テキストの反転損失と分類損失の組み合わせを使用して、学習されたトークンのテキスト機能がクリップ埋め込みスペースのコンセプトの画像機能と一致するようにすることを提案します。
指定されたスーパークラスに適した属性については、トークンに及ぶ低次元サブスペースに学習されたトークンを制限します。
これらの変更は、新しいシーンを生成するために、自然言語で学んだトークンの構成の品質を改善します。
さらに、学習したカスタムトークンを使用してテキストから画像への検索タスクのクエリを形成できることを示し、また、希望の概念が忠実にエンコードされるように、複合クエリを視覚化できる重要な利点もあることを示します。
これに基づいて、検索意図に合わせてクエリが推論時間に変更される生成支援画像検索の方法を紹介します。
DeepFashion2データセットでは、この方法により、関連するベースラインよりも平均相互検索（MRR）が7％改善されます。

要約(オリジナル)

This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.

arxiv情報

著者	Pramuditha Perera,Matthew Trager,Luca Zancato,Alessandro Achille,Stefano Soatto
発行日	2025-02-17 18:13:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Descriminative-Generative Custom Tokens for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー