Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion

要約

ChatGPT などのさまざまな大規模言語モデル (LLM) や、BLIP などの視覚的質問応答 (VQA) モデルが、テキストおよび視覚的な質問応答タスクを解決するために開発されています。
ただし、これらのモデルの微調整は、API 経由でアクセスする必要がありブラックボックスとしてレンダリングされるため困難であるか、多数のパラメーターを調整する必要があるためにコストがかかるかのいずれかです。
これに対処するために、テキストとマルチモーダルの両方の視覚的な質問応答タスクの予測のために、既存のブラックボックスモデルから勝者を動的に選択することを学習するデータ効率の高いアンサンブル手法である InfoSel を導入します。
従来のアンサンブルモデルとは異なり、InfoSel は、通常ブラックボックスモデルでは利用できない予測確率や信頼度に依存しません。
4 つのデータセットに関する実験結果は、1K トレーニングインスタンスのみを使用したスタンドアロン LLM と比較して、私たちのアプローチが F1 スコアで最大 +5.19\% の絶対的な増加を達成することを示しています。

要約(オリジナル)

A diverse range of large language models (LLMs), e.g., ChatGPT, and visual question answering (VQA) models, e.g., BLIP, have been developed for solving textual and visual question answering tasks. However, fine-tuning these models is either difficult, as it requires access via APIs, rendering them as black-boxes, or costly due to the need of tuning a large number of parameters. To address this, we introduce InfoSel, a data-efficient ensemble method that learns to dynamically pick the winner from existing black-box models for predictions on both textual and multimodal visual question answering tasks. Unlike traditional ensemble models, InfoSel does not rely on prediction probabilities or confidences, which typically are not available in black-box models. Experimental results on four datasets demonstrate that our approach achieves an absolute increase of up to +5.19\% in the F1-score compared to standalone LLMs using only 1K training instances.

arxiv情報

著者	Yuxi Xia,Kilm Zaporojets,Benjamin Roth
発行日	2024-12-17 13:31:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Black-box Model Ensembling for Textual and Visual Question Answering via Information Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー