Visual Lexicon: Rich Image Features in Language Space

要約

私たちは、自然言語で伝えるのが難しい複雑な視覚的詳細を保持しながら、豊富な画像情報を語彙トークンのテキスト空間にエンコードする新しい視覚言語である Visual Lexicon を紹介します。
高レベルのセマンティクス (CLIP など) またはピクセルレベルの再構築 (VAE など) のいずれかを優先する従来の方法とは異なり、ViLex は豊富なセマンティクスコンテンツと詳細な視覚的詳細を同時にキャプチャし、高品質の画像生成と包括的な視覚シーンの理解を可能にします。
ViLex は、自己教師あり学習パイプラインを通じて、フリーズされたテキストから画像への (T2I) 拡散モデルを使用して入力画像を再構成するために最適化されたトークンを生成し、高忠実度のセマンティックレベルの再構成に必要な詳細情報を保存します。
言語空間に埋め込まれた画像として、ViLex トークンは自然言語の構成性を活用し、「テキストトークン」として独立して使用したり、自然言語トークンと組み合わせて視覚的入力とテキスト入力の両方を備えた事前トレーニング済み T2I モデルをプロンプトしたりすることを可能にし、私たちの方法を反映しています。
ビジョン言語モデル (VLM) と対話します。
実験では、ViLex トークンが 1 つであっても、ViLex はテキスト埋め込みと比較して画像再構成においてより高い忠実度を達成できることを示しています。
さらに、ViLex は、T2I モデルを微調整することなく、ゼロショット、監視なしの方法でさまざまな DreamBooth タスクを正常に実行します。
さらに、ViLex は強力なビジョンエンコーダーとして機能し、強力な SigLIP ベースラインと比較して 15 のベンチマークにわたってビジョン言語モデルのパフォーマンスを一貫して向上させます。

要約(オリジナル)

We present Visual Lexicon, a novel visual language that encodes rich image information into the text space of vocabulary tokens while retaining intricate visual details that are often challenging to convey in natural language. Unlike traditional methods that prioritize either high-level semantics (e.g., CLIP) or pixel-level reconstruction (e.g., VAE), ViLex simultaneously captures rich semantic content and fine visual details, enabling high-quality image generation and comprehensive visual scene understanding. Through a self-supervised learning pipeline, ViLex generates tokens optimized for reconstructing input images using a frozen text-to-image (T2I) diffusion model, preserving the detailed information necessary for high-fidelity semantic-level reconstruction. As an image embedding in the language space, ViLex tokens leverage the compositionality of natural languages, allowing them to be used independently as ‘text tokens’ or combined with natural language tokens to prompt pretrained T2I models with both visual and textual inputs, mirroring how we interact with vision-language models (VLMs). Experiments demonstrate that ViLex achieves higher fidelity in image reconstruction compared to text embeddings–even with a single ViLex token. Moreover, ViLex successfully performs various DreamBooth tasks in a zero-shot, unsupervised manner without fine-tuning T2I models. Additionally, ViLex serves as a powerful vision encoder, consistently improving vision-language model performance across 15 benchmarks relative to a strong SigLIP baseline.

arxiv情報

著者	XuDong Wang,Xingyi Zhou,Alireza Fathi,Trevor Darrell,Cordelia Schmid
発行日	2024-12-09 18:57:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Lexicon: Rich Image Features in Language Space

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー