Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters

要約

人間はマルチモーダルな知識を介して言語を学習します。
ただし、テキストのみの事前トレーニングスキームのため、ほとんどの既存の事前トレーニング済み言語モデル (PLM) はマルチモーダル情報の影響を受けます。
視覚的な知識を PLM に注入するために、既存の方法では、視覚言語モデル (VLM) のテキストまたは画像エンコーダーを組み込んで視覚情報をエンコードし、知識融合のために PLM の元のパラメーターをすべて更新します。
このペーパーでは、事前トレーニングされた VLM で学習した調整された視覚的およびテキストの知識を柔軟に活用し、それらを PLM に効率的に注入するための、新しいプラグアンドプレイモジュールである X アダプターを提案します。
具体的には、X アダプターを PLM に挿入し、追加されたパラメーターのみが適応中に更新されます。
VLM の可能性を最大限に活用するために、X アダプターは 2 つのサブモジュール、V-expert と T-expert で構成され、それぞれ VLM の画像表現とテキスト表現を融合します。
下流のタスクに応じて、さまざまなサブモジュールをアクティブ化することを選択できます。
実験結果は、私たちの方法が、PLM ベースラインと比較して、オブジェクトの色の推論と自然言語理解 (NLU) タスクのパフォーマンスを大幅に向上できることを示しています。

要約(オリジナル)

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs’ image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

arxiv情報

著者	Xinyun Zhang,Haochen Tan,Han Wu,Mingjie Zhan,Ding Liang,Bei Yu
発行日	2023-08-28 11:07:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー