ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

要約

CLIP は、ビジュアル空間と言語空間を調整することが、明示的なトレーニングなしで多くのビジョンタスクを解決するための鍵であることが証明されましたが、巨大なデータセットで画像とテキストエンコーダーをゼロからトレーニングする必要があります。
LiT は、テキストエンコーダーのトレーニングのみを行い、事前トレーニング済みのビジョンネットワークを使用することで、これを改善しました。
このホワイトペーパーでは、単一ドメインエンコーダー (監視ありまたは監視なしでトレーニング済み) とはるかに少量の画像とテキストのペアを使用して、トレーニングをまったく行わなくても共通空間を作成できることを示します。
さらに、モデルには独自の特性があります。
最も注目に値するのは、更新されたトレーニングサンプルを使用した新しいバージョンのデプロイが数秒で完了できることです。
さらに、すべての次元がマルチモーダルデータセット内の一意のエントリへの入力の類似性に対応するため、共通空間の表現は簡単に解釈できます。
標準的なゼロショットビジュアルベンチマークでの実験では、画像テキストモデルの典型的な転送能力が実証されています。
全体として、私たちの方法は、基本的なマルチモーダルモデルのシンプルでありながら驚くほど強力なベースラインを表しており、データの効率性と機械学習における検索の役割に関する重要な問題を提起しています。

要約(オリジナル)

CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

arxiv情報

著者	Antonio Norelli,Marco Fumero,Valentino Maiorca,Luca Moschella,Emanuele Rodolà,Francesco Locatello
発行日	2023-02-10 18:38:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー