Grounding Descriptions in Images informs Zero-Shot Visual Recognition

要約

CLIP のような視覚言語モデル (VLM) は、オープンな語彙概念に基づいてゼロショット視覚認識を実行できる機能で高く評価されてきました。
これは、テキスト表現がクエリ画像と最も類似しているオブジェクトカテゴリを選択することによって実現されます。
この方法は一部の領域では成功していますが、きめの細かいエンティティの特定や、トレーニング分布では捉えられない目に見えない概念への一般化に苦労しています。
最近の研究では、テスト時にカテゴリの説明を統合することでこれらの課題を軽減しようとしていますが、わずかな改善が見られます。
これらの限られたゲインは、CLIP の事前トレーニング構造に根ざした、画像表現と説明表現の間の根本的な不整合によるものであると考えられます。
この論文では、細かいレベルと粗いレベルの両方で表現を同時に調整することを目的とした新しい事前トレーニング戦略である GRAIN を提案します。
私たちのアプローチは、包括的なキャプションを全体的な画像表現と整合させるとともに、画像領域内のテキスト説明を共同で基盤化することを学習します。
この事前トレーニングを推進するために、凍結されたマルチモーダル大規模言語モデル (MLLM) を利用して大規模な合成アノテーションを導き出します。
11 の多様な画像分類データセットにわたって、現在の最先端の方法と比較して、モデルのゼロショットパフォーマンスが向上していることを実証します。
さらに、新しい概念を特徴とする新しく厳選され、手動でラベル付けされたデータセットである Products-2023 を紹介し、ベンチマークを行うことでこれらの概念を認識するモデルの能力を紹介します。
検索などの他の下流タスクで私たちのモデルによって達成された大幅な改善は、私たちのアプローチによって学習された表現の優れた品質をさらに強調しています。
コードは https://github.com/shaunak27/gran-clip で入手できます。

要約(オリジナル)

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model’s ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at https://github.com/shaunak27/grain-clip .

arxiv情報

著者	Shaunak Halbe,Junjiao Tian,K J Joseph,James Seale Smith,Katherine Stevo,Vineeth N Balasubramanian,Zsolt Kira
発行日	2024-12-05 18:52:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー