MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

要約

大規模な事前トレーニング済みモデルは、ユニモーダルビジョンおよび言語タスクにおいて、注目に値するゼロおよび (プロンプトベースの) 少数ショット学習器であることが証明されています。
MAPL は、凍結された事前トレーニング済みのユニモーダルモデルを再利用し、マルチモーダルビジョン言語 (VL) 設定で強力な一般化機能を活用する、シンプルでパラメーター効率の高い方法です。
MAPL は、位置合わせされた画像とテキストのデータを使用して、ユニモーダルモデルの表現空間間の軽量なマッピングを学習し、コンテキスト内のいくつかの例から目に見えない VL タスクに一般化できます。
トレーニング可能なパラメーターの数が少ないため、MAPL は低データおよびドメイン内学習で効果的です。
さらに、MAPL のモジュール性により、他の事前トレーニング済みモデルへの拡張が容易になります。
いくつかの視覚的質問応答と画像キャプションのベンチマークに関する広範な実験により、MAPL は桁違いに少ないパラメーターをトレーニングしながら、同様の方法と比較して優れた、または競争力のあるパフォーマンスを達成することが示されています。
MAPL は、適度な計算リソースと公開データセットを使用して、わずか数時間でトレーニングできます。
https://github.com/mair-lab/mapl で、コードと事前トレーニング済みのモデルの重みをリリースします。

要約(オリジナル)

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.

arxiv情報

著者	Oscar Mañas,Pau Rodriguez,Saba Ahmadi,Aida Nematzadeh,Yash Goyal,Aishwarya Agrawal
発行日	2023-03-15 00:02:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー