CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

要約

視覚言語モデルの最近の進歩により、オブジェクト検出やセグメンテーションなどの下流のタスクに応用できる、ゼロショットのテキストと画像のマッチング能力が顕著に示されています。
ただし、これらのモデルを物体カウントに適応させることは依然として困難な課題です。
この研究では、まずクラスに依存しないオブジェクトカウントのための転送ビジョン言語モデル (VLM) を調査します。
具体的には、ゼロショット方式でテキストガイダンスを使用してオープン語彙オブジェクトの密度マップを推定する初のエンドツーエンドパイプラインである CLIP-Count を提案します。
テキスト埋め込みを高密度の視覚的特徴と一致させるために、モデルが高密度予測のための有益なパッチレベルの視覚表現を学習するように導くパッチテキストのコントラスト損失を導入します。
さらに、階層的なパッチとテキストの相互作用モジュールを設計して、視覚的特徴のさまざまな解像度レベルにわたって意味論的な情報を伝播します。
事前学習済み VLM の豊富な画像とテキストの位置合わせに関する知識を最大限に活用することで、私たちの方法は、対象オブジェクトの高品質な密度マップを効果的に生成します。
FSC-147、CARPK、ShanghaiTech 群衆計数データセットに関する広範な実験により、提案された手法の最先端の精度と一般化可能性が実証されました。
コードは https://github.com/songrise/CLIP-Count から入手できます。

要約(オリジナル)

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count.

arxiv情報

著者	Ruixiang Jiang,Lingbo Liu,Changwen Chen
発行日	2023-08-10 04:04:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー