Probabilistic Language-Image Pre-Training

要約

ビジョン言語モデル（VLM）は、アライメントされた画像テキストペアをジョイント空間に埋め込みますが、多くの場合、画像とテキストの間に1対1の対応を想定して、決定論的な埋め込みに依存します。
これは、本質的に多くの人から多数の現実世界の関係を単純化しすぎており、複数のキャプションが単一の画像を記述し、その逆も同様です。
確率的な目標のみを使用して10億スケールの画像テキストデータセットで事前に訓練された最初の確率的VLMである確率的言語イメージ前訓練（PROLIP）を紹介し、強力なゼロショット機能を達成します（たとえば、VIT-B/16で74.6％のImagenet Zero-Shot精度）。
ProLipは、追加のパラメーターなしで「不確実性トークン」によって不確実性を効率的に推定します。
また、画像テキストペア間および元の入力とマスクされた入力間の分布包有関係を強制する新しいインクルージョン損失を導入します。
実験は、不確実性の推定値を活用することにより、ProLipが下流のタスクに利益をもたらし、不確実性の直感的な概念と一致することを示しています。
テキストの不確実性を利用して、イメージネットの精度をさらに74.6％から75.8％（いくつかのショット設定で）に改善し、確率的アプローチの実際的な利点をサポートします。
このコードは、https：//github.com/naver-ai/prolipで入手できます

要約(オリジナル)

Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an ‘uncertainty token’ without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip

arxiv情報

著者	Sanghyuk Chun,Wonjae Kim,Song Park,Sangdoo Yun
発行日	2025-03-12 14:03:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probabilistic Language-Image Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー