Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

要約

基礎モデルまたは事前トレーニングされたモデルにより、さまざまな言語、視覚、および視覚言語理解タスクのパフォーマンスが大幅に向上しました。
ただし、既存の基盤モデルは、言語、ビジョン、またはビジョン言語という 1 種類のタスクでのみ最高のパフォーマンスを発揮できます。
すべての理解タスクに対して最高のパフォーマンスを発揮する基礎モデル (一般基礎モデルと呼ばれる) を構築できるかどうかは、まだ未解決の問題です。
本稿では、新しい一般基礎モデル X-FM (X-Foundation Model) を提案します。
X-FM には、言語エンコーダー、ビジョンエンコーダー、フュージョンエンコーダーが 1 つと、新しいトレーニング方法が搭載されています。
このトレーニング方法には、テキスト、画像、および画像とテキストのペアのデータから X-FM を学習するための 2 つの新しい手法が含まれています。
1 つは、言語エンコーダーを学習するときに視覚言語トレーニングからの勾配を停止することです。
もう 1 つは、ビジョン言語トレーニングを活用して、ビジョンエンコーダーの学習をガイドすることです。
ベンチマークデータセットに関する広範な実験により、X-FM は既存の一般的な基礎モデルを大幅に上回り、特に言語、視覚、または視覚と言語の理解に関して既存の基礎モデルよりも優れた、または同等のパフォーマンスを発揮できることが示されています。
コードと事前トレーニングされたモデルは https://github.com/zhangxinsong-nlp/XFM でリリースされます。

要約(オリジナル)

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding. Code and pre-trained models are released at https://github.com/zhangxinsong-nlp/XFM.

arxiv情報

著者	Xinsong Zhang,Yan Zeng,Jipeng Zhang,Hang Li
発行日	2023-10-17 16:11:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー