LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

要約

ビジョン言語モデル (VLM) は、画像とクラスの埋め込み間の類似性を比較することにより、画像分類のための有望なパラダイムを提供します。
重要な課題は、クラス名の正確なテキスト表現を作成することにあります。
これまでの研究では、大規模言語モデル (LLM) の最近の進歩を利用してこれらの記述子を強化してきましたが、その出力には曖昧さと不正確さが伴うことがよくあります。
これは主に 2 つの要因によるものであると考えられます。1) LLM とのシングルターンテキストインタラクションに依存しており、生成されたテキストと VLM のビジュアルコンセプトの間に不一致が生じます。
2) クラス間の関係が見落とされ、類似したクラスを効果的に区別できない記述子が生成されます。
この論文では、LLM と VLM を統合して最適なクラス記述子を見つける新しいフレームワークを提案します。
私たちのトレーニング不要のアプローチでは、クラス記述子を繰り返し改良するための進化的最適化戦略を備えた LLM ベースのエージェントを開発します。
私たちは、最適化された記述子が高品質であり、幅広いベンチマークで分類精度を効果的に向上させることを実証します。
さらに、これらの記述子は説明可能で堅牢な機能を提供し、さまざまなバックボーンモデル全体でパフォーマンスを向上させ、微調整ベースの方法を補完します。

要約(オリジナル)

Vision-language models (VLMs) offer a promising paradigm for image classification by comparing the similarity between images and class embeddings. A critical challenge lies in crafting precise textual representations for class names. While previous studies have leveraged recent advancements in large language models (LLMs) to enhance these descriptors, their outputs often suffer from ambiguity and inaccuracy. We attribute this to two primary factors: 1) the reliance on single-turn textual interactions with LLMs, leading to a mismatch between generated text and visual concepts for VLMs; 2) the oversight of the inter-class relationships, resulting in descriptors that fail to differentiate similar classes effectively. In this paper, we propose a novel framework that integrates LLMs and VLMs to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors. We demonstrate our optimized descriptors are of high quality which effectively improves classification accuracy on a wide range of benchmarks. Additionally, these descriptors offer explainable and robust features, boosting performance across various backbone models and complementing fine-tuning-based methods.

arxiv情報

著者	Songhao Han,Le Zhuo,Yue Liao,Si Liu
発行日	2024-02-19 09:24:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー