Learning Concise and Descriptive Attributes for Visual Recognition

要約

基礎モデルの最近の進歩により、解釈可能な視覚認識の新たな機会がもたらされました。まず大規模言語モデル (LLM) をクエリして各クラスを記述する属性セットを取得し、次に視覚言語モデルを適用してこれらの属性を介して画像を分類できます。
先駆的な研究により、何千もの属性をクエリすることで、画像の特徴に匹敵するパフォーマンスを達成できることが示されています。
しかし、8 つのデータセットをさらに調査したところ、LLM によって生成された大量の属性がランダムな単語とほぼ同じように機能することが明らかになりました。
この驚くべき発見は、これらの属性に重大なノイズが存在する可能性があることを示唆しています。
我々は、はるかに小さいサイズで分類パフォーマンスを維持できる属性のサブセットが存在するという仮説を立て、それらの簡潔な属性セットを発見するための新しい検索学習方法を提案します。
その結果、CUB データセットでは、私たちの方法は、LLM で生成された大規模な属性 (たとえば、CUB の 10,000 属性) に近いパフォーマンスを達成しながら、200 種の鳥を識別するために合計 32 属性のみを使用しています。
さらに、私たちの新しいパラダイムは、人間にとってのより高い解釈性と対話性、認識タスクのための知識を要約する機能など、いくつかの追加の利点を示しています。

要約(オリジナル)

Recent advances in foundation models present new opportunities for interpretable visual recognition — one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task.

arxiv情報

著者	An Yan,Yu Wang,Yiwu Zhong,Chengyu Dong,Zexue He,Yujie Lu,William Wang,Jingbo Shang,Julian McAuley
発行日	2023-08-07 16:00:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Concise and Descriptive Attributes for Visual Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー