Probabilistic Language-Image Pre-Training

要約

視覚言語モデル (VLM) は、位置合わせされた画像とテキストのペアを結合空間に埋め込みますが、多くの場合、画像とテキストが 1 対 1 で対応すると仮定して、決定論的な埋め込みに依存します。
これにより、複数のキャプションで 1 つの画像を説明したり、その逆を行うなど、本質的に多対多である現実世界の関係が過度に単純化されます。
私たちは、確率的目標のみを使用して 10 億規模の画像テキストデータセットで事前トレーニングされた初の確率的 VLM である確率的言語画像事前トレーニング (ProLIP) を導入し、強力なゼロショット機能 (例: 74.6% ImageNet ゼロショット) を実現します。
ViT-B/16 による精度）。
ProLIP は、追加のパラメーターを使用せずに、「不確実性トークン」によって不確実性を効率的に推定します。
また、画像とテキストのペアの間、および元の入力とマスクされた入力の間の分布包含関係を強制する新しい包含損失も導入します。
実験では、不確実性の推定を活用することで、ProLIP が下流のタスクに利益をもたらし、不確実性の直観的な概念と一致することが実証されています。たとえば、短いテキストほど不確実性が高く、特定のテキストを含むより一般的な入力となります。
テキストの不確実性を利用して、ImageNet の精度を 74.6% から 75.8% (数ショット設定下) にさらに改善し、確率的アプローチの実際的な利点を裏付けています。
コードは https://github.com/naver-ai/prolip で入手できます。

要約(オリジナル)

Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an ‘uncertainty token’ without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip

arxiv情報

著者	Sanghyuk Chun,Wonjae Kim,Song Park,Sangdoo Yun
発行日	2024-12-06 15:20:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Probabilistic Language-Image Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー