ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

要約

CLIP は、視覚空間と言語空間を調整することが、明示的なトレーニングなしで多くの視覚タスクを解決するための鍵となりますが、巨大なデータセット上で画像とテキストのエンコーダーをゼロからトレーニングする必要があることを証明しました。
LiT は、テキストエンコーダーのみをトレーニングし、事前トレーニングされたビジョンネットワークを使用することで、これを改善しました。
この論文では、単一ドメインのエンコーダ (監視の有無にかかわらずトレーニングされた) とはるかに少ない量の画像とテキストのペアを使用して、トレーニングをまったく行わずに共通スペースを作成できることを示します。
さらに、私たちのモデルにはユニークな特性があります。
最も注目すべき点は、更新されたトレーニングサンプルを含む新しいバージョンのデプロイが数秒で完了できることです。
さらに、すべての次元がマルチモーダルデータセット内の固有の画像とテキストのペアに対する入力の類似性に対応するため、共通空間の表現は容易に解釈できます。
標準のゼロショット視覚ベンチマークの実験により、画像テキストモデルの典型的な転送能力が実証されました。
全体として、私たちの方法は基礎マルチモーダルモデルのシンプルでありながら驚くほど強力なベースラインを表しており、データ効率と機械学習における検索の役割について重要な疑問を引き起こしています。

要約(オリジナル)

CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique image-text pair in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multimodal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.

arxiv情報

著者	Antonio Norelli,Marco Fumero,Valentino Maiorca,Luca Moschella,Emanuele Rodolà,Francesco Locatello
発行日	2023-11-10 10:44:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー