Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

要約

きめ細かな視覚認識（FGVR）には、視覚的に類似したカテゴリーを区別することが含まれるが、クラス間の微妙な違いや、専門家が注釈を付けた大規模なデータセットの必要性から、本質的に困難である。医療画像のような領域では、プライバシーへの懸念や高いアノテーションコストのような問題のために、そのようなキュレーションされたデータセットは利用できません。ラベル付けされたデータがないこのようなシナリオでは、FGVRモデルは事前に定義されたトレーニングラベルのセットに頼ることができないため、予測のための制約のない出力空間を持つ。我々はこのタスクを語彙フリーFGVR（Vocabulary-Free FGVR：VF-FGVR）と呼び、モデルは事前のラベル情報なしに制約のない出力空間からラベルを予測しなければならない。最近のMLLM（Multimodal Large Language Models）はVF-FGVRの可能性を示しているが、各テスト入力に対してこれらのモデルを問い合わせることは、高いコストと法外な推論時間のために非現実的である。これらの限界に対処するために、我々はMLLMによって生成されたラベルを使用して下流のCLIPモデルを微調整する新しいアプローチである、↪NeaR↩textbf{Nea}rest-Neighbor Label↪NeaR↩ Refinement（NeaR）を導入する。本アプローチでは、ラベル生成にMLLMを活用し、ラベルのない小さな訓練セットから弱い教師ありデータセットを構築する。NeaRは、MLLMによって生成されるラベルに固有のノイズ、確率性、オープンエンド性を扱うように設計されており、効率的なVF-FGVRの新しいベンチマークを確立する。

要約(オリジナル)

Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce \textbf{Nea}rest-Neighbor Label \textbf{R}efinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.

arxiv情報

著者	Hari Chandana Kuchibhotla,Sai Srinivas Kancheti,Abbavaram Gowtham Reddy,Vineeth N Balasubramanian
発行日	2025-05-02 07:14:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー