I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

要約

最近の研究により、オンラインソースからの非構造化テキスト（文書）が、ゼロショット画像分類のための有用な補助情報として機能することが示されている。しかし、これらの方法は、Wikipediaのような高品質のソースへのアクセスを必要とし、単一の情報源に制限されている。Webスケールのテキストで学習した大規模言語モデル（LLM）は、学習した知識を様々なタスクに再利用する素晴らしい能力を示している。この研究では、ゼロショット画像分類モデルにテキスト監視を提供するためにLLMを使用するという新しい視点を提供する。LLMには、例として異なる注釈者からのいくつかのテキスト説明が提供される。LLMはこれらの事例を条件として、各クラス（ビューと呼ぶ）に対して複数のテキスト記述を生成する。提案モデルI2MVFormerは、これらのクラスビューを用いて、ゼロショット画像分類のためのマルチビュー意味埋め込みを学習する。本論文では、クラスに関する各テキストビューが補完的な情報を提供することで、モデルが高い識別性を持つクラス埋め込みを学習できることを示す。さらに、I2MVFormerはベースラインモデルと比較して、LLMからのマルチビューテキスト監視を消費するのに優れていることを示す。I2MVFormerは、教師なし意味埋め込みを用いたゼロショット画像分類のための3つの公的ベンチマークデータセットにおいて、新たな最先端技術を確立することができた。

要約(オリジナル)

Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.

arxiv情報

著者	Muhammad Ferjad Naeem,Muhammad Gul Zain Ali Khan,Yongqin Xian,Muhammad Zeshan Afzal,Didier Stricker,Luc Van Gool,Federico Tombari
発行日	2022-12-05 14:11:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー