Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

要約

知識ベースの視覚的質問応答 (VQA) では、質問に答えるために画像を超えた外部知識が必要です。
初期の研究では、明示的な知識ベース (KB) から必要な知識を取得します。これにより、質問に無関係な情報が導入されることが多く、モデルのパフォーマンスが制限されます。
最近の研究では、回答に必要な知識を取得するための暗黙の知識エンジンとして大規模な言語モデル (GPT-3) を使用しようとしてきました。
これらの方法によって有望な結果が得られたにもかかわらず、提供された入力情報が不十分であるため、GPT-3 の能力を完全には活性化していないと主張します。
このホワイトペーパーでは、知識ベースの VQA の回答ヒューリスティックを使用して GPT-3 を促すように設計された、概念的に単純なフレームワークである Prophet を紹介します。
具体的には、まず、外部知識なしで、特定の知識ベースの VQA データセットでバニラ VQA モデルをトレーニングします。
その後、モデルから 2 種類の補完的な回答ヒューリスティック (回答候補と回答認識例) を抽出します。
最後に、2 種類の回答ヒューリスティックがプロンプトにエンコードされ、GPT-3 がタスクをよりよく理解し、その能力を強化できるようにします。
Prophet は、2 つの困難な知識ベースの VQA データセット、OK-VQA と A-OKVQA で既存のすべての最先端の方法を大幅に上回り、テストセットでそれぞれ 61.1% と 55.7% の精度を実現します。

要約(オリジナル)

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient. In this paper, we present Prophet — a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.

arxiv情報

著者	Zhenwei Shao,Zhou Yu,Meng Wang,Jun Yu
発行日	2023-03-16 01:49:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー