A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI

要約

視覚と言語を組み合わせたシステムなど、マルチモーダルな生成 AI システムは、対照的な事前トレーニングに依存して、さまざまなモダリティにわたる表現を学習します。
これらの実際的な利点は広く認識されていますが、対照的な事前トレーニングフレームワークの厳密な理論的理解は依然として限られています。
この論文では、ゼロショット分類、条件付き拡散モデル、視覚言語モデルなどの下流タスクにおける対照的事前トレーニングの成功を説明する理論的フレームワークを開発します。
古典的な十分統計の一般化である近似十分統計の概念を導入し、対比的な事前トレーニング損失のほぼ最小化がほぼ十分であり、それがさまざまな下流タスクに適応できることを示します。
さらに、画像とテキストの結合分布のための結合生成階層モデルを提案し、変換器が信念伝播を通じてこのモデル内の関連関数を効率的に近似できることを示します。
このフレームワークに基づいて、事前にトレーニングされた対照的な表現に基づいて、マルチモーダル学習のサンプルの複雑さの保証を導き出します。
数値シミュレーションはこれらの理論的発見を検証し、さまざまなマルチモーダルタスクにおける対照的に事前学習されたトランスフォーマーの強力な一般化パフォーマンスを実証します。

要約(オリジナル)

Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.

arxiv情報

著者	Kazusato Oko,Licong Lin,Yuhang Cai,Song Mei
発行日	2025-01-08 17:47:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー