See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning


この目的のために、数ショットの知識ベースの視覚的推論のための Interactive Prompting Visual Reasoner (IPVR) という新しいフレームワークを提案します。
思考段階では、事前にトレーニングされた大規模言語モデル (LLM) を採用して、候補者からの主要な概念に適応的に対応します。
次に、それらをテキスト コンテキストに変換して、視覚的なキャプション モデルでプロンプトを表示し、LLM を採用して回答を生成します。
確認段階ではさらに、LLM を使用して、回答を裏付ける根拠を生成し、生成された根拠をモダリティ間分類子で検証し、根拠が予測された出力を一貫して推測できることを確認します。
当社の IPVR にはいくつかの利点があることがわかりました 1)。


Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect the external world knowledge, and perform step-by-step reasoning to answer the questions correctly. To this end, we propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our IPVR enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines.


著者 Zhenfang Chen,Qinhong Zhou,Yikang Shen,Yining Hong,Hao Zhang,Chuang Gan
発行日 2023-01-12 18:59:50+00:00
arxivサイト arxiv_id(pdf)

カテゴリー: cs.AI, cs.CL, cs.CV, cs.LG パーマリンク