Multimodal Question Answering for Unified Information Extraction

要約

マルチモーダル情報抽出 (MIE) は、非構造化マルチメディアコンテンツから構造化情報を抽出することを目的としています。
タスクと設定が多様であるため、現在の MIE モデルのほとんどはタスク固有でデータ集約型となっており、多様なタスク要件と限られたラベル付きデータを伴う現実世界のシナリオへの一般化が制限されています。
これらの問題に対処するために、我々は、3 つの MIE タスクを統合スパン抽出および複数選択 QA パイプラインに再定式化することで統合する、新しいマルチモーダル質問応答 (MQA) フレームワークを提案します。
6 つのデータセットに対する広範な実験により、次のことがわかりました。 1) 私たちの MQA フレームワークは、バニラプロンプトと比較して、MIE タスクにおけるさまざまな既製の大規模マルチモーダルモデル (LMM) のパフォーマンスを一貫して大幅に向上させます。
2) ゼロショット設定では、MQA は以前の最先端のベースラインを大幅に上回ります。
さらに、私たちのフレームワークの有効性は少数ショット設定にうまく移行でき、10B パラメーターのスケールで LMM を強化して、ChatGPT や GPT-4 などのはるかに大きな言語モデルと競合するか、それを上回るパフォーマンスを実現します。
当社の MQA フレームワークは、MIE や潜在的に他のダウンストリームマルチモーダルタスクをより適切に解決するために LMM を利用する一般原則として機能します。

要約(オリジナル)

Multimodal information extraction (MIE) aims to extract structured information from unstructured multimedia content. Due to the diversity of tasks and settings, most current MIE models are task-specific and data-intensive, which limits their generalization to real-world scenarios with diverse task requirements and limited labeled data. To address these issues, we propose a novel multimodal question answering (MQA) framework to unify three MIE tasks by reformulating them into a unified span extraction and multi-choice QA pipeline. Extensive experiments on six datasets show that: 1) Our MQA framework consistently and significantly improves the performances of various off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla prompting. 2) In the zero-shot setting, MQA outperforms previous state-of-the-art baselines by a large margin. In addition, the effectiveness of our framework can successfully transfer to the few-shot setting, enhancing LMMs on a scale of 10B parameters to be competitive or outperform much larger language models such as ChatGPT and GPT-4. Our MQA framework can serve as a general principle of utilizing LMMs to better solve MIE and potentially other downstream multimodal tasks.

arxiv情報

著者	Yuxuan Sun,Kai Zhang,Yu Su
発行日	2023-10-04 17:58:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Question Answering for Unified Information Extraction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー