UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

要約

Contrastive language-image pre-training (CLIP) に代表される視覚言語基礎モデルは、視覚タスクとテキストタスクの両方を共同で理解するためにますます注目を集めています。
しかし、既存のアプローチは主に、グローバルな画像表現とテキストの説明を一致させるためのモデルのトレーニングに焦点を当てており、そのため、ローカル領域と対応するテキストトークンの間の重要な調整が見落とされています。
この文書では、CLIP を多重粒度アライメントで拡張します。
特に、画像レベル、領域レベル、ピクセルレベルのキャプション/タグを含む、さまざまな粒度レベルの疑似アノテーションで構成される新しいデータセットを意図的に構築しています。
したがって、私たちは、UMG-CLIP という名前の統合された多粒度学習フレームワークを開発します。これは、さまざまな詳細レベルにわたる汎用性の高い知覚能力をモデルに同時に与えます。
パラメータ効率の高いチューニングを備えた UMG-CLIP は、現在広く使用されている CLIP モデルを上回り、オープンワールド認識、検索、セマンティックセグメンテーション、パノプティックセグメンテーションタスクなど、さまざまな画像理解ベンチマークで最先端のパフォーマンスを実現します。
UMG-CLIP が視覚言語基盤モデルを進化させるための貴重なオプションとして機能することを願っています。

要約(オリジナル)

Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level, and pixel-level captions/tags. Accordingly, we develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail. Equipped with parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP models and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We hope UMG-CLIP can serve as a valuable option for advancing vision-language foundation models.

arxiv情報

著者	Bowen Shi,Peisen Zhao,Zichen Wang,Yuhang Zhang,Yaoming Wang,Jin Li,Wenrui Dai,Junni Zou,Hongkai Xiong,Qi Tian,Xiaopeng Zhang
発行日	2024-01-18 16:40:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー