Foundation Model is Efficient Multimodal Multitask Model Selector

要約

この論文では、十分に調査されていないが重要な問題を調査します。それは、事前にトレーニングされたニューラルネットワークのコレクションが与えられ、画像認識、参照、キャプション、視覚的な質問応答など、微調整することなく各マルチモーダルタスクのパフォーマンスを予測するというものです。
テキストの質問応答。
ブルートフォース手法では、すべてのターゲットデータセットですべてのモデルを微調整するため、高い計算コストがかかります。
最近の高度なアプローチでは、モデルの伝達性を測定するために軽量のメトリクスが採用されていますが、多くの場合、単一タスクの事前知識に大きく依存しているため、マルチモーダルマルチタスクシナリオには適用できません。
この問題に取り組むために、私たちは効率的なマルチタスクモデルセレクター (EMMS) を提案します。これは、大規模な基盤モデルを使用して、カテゴリ、テキスト、さまざまな下流タスクの境界ボックスなどの多様なラベル形式を、統合されたノイズの多いラベル埋め込みに変換します。
EMMS は、単純な重み付き線形回帰を通じてモデルの伝達可能性を推定できます。これは、収束保証を備えた交互最小化アルゴリズムによって効率的に解決できます。
24 個のデータセットを使用した 5 つの下流タスクに関する広範な実験により、EMMS は事前トレーニング済みモデルの移行可能性を評価するのに十分な高速性、効果性、汎用性を備えていることが示され、マルチタスクシナリオにおける最初のモデル選択方法となります。
たとえば、ラベル埋め込みによって強化された最先端の手法 LogME と比較して、EMMS は画像認識で 9.0\%、26.3\%、20.1\%、54.8\%、12.2\% のパフォーマンス向上を達成しています。以下を参照します。
キャプション、ビジュアルな質問応答、およびテキストによる質問応答が可能になり、実測時間でそれぞれ 5.13 倍、6.29 倍、3.59 倍、6.19 倍、および 5.66 倍のスピードアップをもたらします。
コードは https://github.com/OpenGVLab/Multitask-Model-Selector で入手できます。

要約(オリジナル)

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models’ transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model’s transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

arxiv情報

著者	Fanqing Meng,Wenqi Shao,Zhanglin Peng,Chonghe Jiang,Kaipeng Zhang,Yu Qiao,Ping Luo
発行日	2023-08-11 17:54:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Foundation Model is Efficient Multimodal Multitask Model Selector

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー