Data-Efficient Multimodal Fusion on a Single GPU

要約

マルチモーダルアライメントの目標は、マルチモーダル入力間で共有される単一の潜在空間を学習することです。
この分野で最も強力なモデルは、ペアの入力からなる大規模なデータセットと大規模な計算リソースを使用してトレーニングされているため、多くの実際的なシナリオでトレーニングするには法外なコストがかかります。
大量のユニモーダルデータで事前トレーニングされた既存のユニモーダルエンコーダーは、はるかに低コストでユニモーダルモデルからマルチモーダルモデルを作成するための効果的なブートストラップを提供すると推測します。
したがって、我々は、任意の事前トレーニング済みユニモーダルエンコーダーの潜在空間で動作するマルチモーダル拡張スキームである FuseMix を提案します。
マルチモーダルアライメントに FuseMix を使用すると、画像テキストと音声テキストの両方の検索において、計算量とデータを大幅に削減しながら、競争力のあるパフォーマンスを実現し、場合によっては最先端の方法を上回るパフォーマンスを実現します。
$\sim \! を使用した Flickr30K のテキストから画像への取得タスクで CLIP よりも優れたパフォーマンスを発揮します。
GPU 日数が 600\times$ 減り、$\sim \!
画像とテキストのペアが 80\time$ 少なくなります。
さらに、私たちの方法を適用して、事前トレーニングされたテキストから画像への生成モデルを音声から画像への生成モデルに変換する方法を示します。
コードは https://github.com/layer6ai-labs/fusemix から入手できます。

要約(オリジナル)

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance — and in certain cases outperform state-of-the art methods — in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

arxiv情報

著者	Noël Vouitsis,Zhaoyan Liu,Satya Krishna Gorti,Valentin Villecroze,Jesse C. Cresswell,Guangwei Yu,Gabriel Loaiza-Ganem,Maksims Volkovs
発行日	2024-01-02 15:16:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data-Efficient Multimodal Fusion on a Single GPU

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー