Real Classification by Description: Extending CLIP’s Limits of Part Attributes Recognition

要約

この研究では、説明によるゼロショット「実際の」分類を定義して取り組みます。これは、オブジェクトのクラス名を除外し、説明的な属性のみに基づいてオブジェクトを分類する CLIP のような視覚言語モデル (VLM) の能力を評価する新しいタスクです。
このアプローチは、複雑なオブジェクトの説明を理解する際の VLM の現在の限界を浮き彫りにし、これらのモデルを単なるオブジェクト認識を超えたものにします。
この探索を促進するために、新しいチャレンジを導入し、6 つの一般的なきめ細かいベンチマークの説明データをリリースします。これらのベンチマークでは、研究コミュニティ内での真のゼロショット学習を促進するためにオブジェクト名が省略されています。
さらに、ImageNet21k の多様なオブジェクトカテゴリと、大規模な言語モデルによって生成された豊富な属性記述を組み合わせた、ターゲットを絞ったトレーニングを通じて CLIP の属性検出機能を強化する方法を提案します。
さらに、複数の解像度を利用して詳細なパーツ属性の検出を向上させる、修正された CLIP アーキテクチャを導入します。
これらの取り組みを通じて、CLIP における部品属性認識の理解を広げ、6 つの一般的なベンチマークにわたるきめ細かい分類タスクと、オブジェクト属性認識のベンチマークとして広く使用されている PACO データセットにおけるパフォーマンスを向上させます。
コードは https://github.com/ethanbar11/grounding_ge_public から入手できます。

要約(オリジナル)

In this study, we define and tackle zero shot ‘real’ classification by description, a novel task that evaluates the ability of Vision-Language Models (VLMs) like CLIP to classify objects based solely on descriptive attributes, excluding object class names. This approach highlights the current limitations of VLMs in understanding intricate object descriptions, pushing these models beyond mere object recognition. To facilitate this exploration, we introduce a new challenge and release description data for six popular fine-grained benchmarks, which omit object names to encourage genuine zero-shot learning within the research community. Additionally, we propose a method to enhance CLIP’s attribute detection capabilities through targeted training using ImageNet21k’s diverse object categories, paired with rich attribute descriptions generated by large language models. Furthermore, we introduce a modified CLIP architecture that leverages multiple resolutions to improve the detection of fine-grained part attributes. Through these efforts, we broaden the understanding of part-attribute recognition in CLIP, improving its performance in fine-grained classification tasks across six popular benchmarks, as well as in the PACO dataset, a widely used benchmark for object-attribute recognition. Code is available at: https://github.com/ethanbar11/grounding_ge_public.

arxiv情報

著者	Ethan Baron,Idan Tankel,Peter Tu,Guy Ben-Yosef
発行日	2024-12-18 15:28:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Real Classification by Description: Extending CLIP’s Limits of Part Attributes Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー