II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

要約

Visual Question Answering (VQA) には、視覚と言語 (V&L) にわたる多様な推論シナリオが含まれることがよくあります。
しかし、これまでの VQA 研究のほとんどは、さまざまな推論ケースでモデルを評価することなく、モデルの全体的な精度を評価することにのみ焦点を当てていました。
さらに、最近の研究では、特にマルチホップ推論を必要とする複雑なシナリオの場合、従来の思考連鎖 (CoT) プロンプトでは VQA に対する効果的な推論を生成できないことが観察されています。
この論文では、VQA におけるマルチモーダルマルチホップ推論を特定し、改善するための新しいアイデアである II-MMR を提案します。
具体的には、II-MMR は、画像付きの VQA 質問を受け取り、2 つの新しい言語プロンプト ((i) 予測ガイド付き CoT プロンプトへの回答、または (ii) 知識トリプレットガイド付きプロンプト) を使用して、その答えに到達するための推論パスを見つけます。
次に、II-MMR はこのパスを分析し、質問に答えるために必要なホップ数と推論の種類 (つまり、視覚的または視覚的を超えた) を推定することで、現在の VQA ベンチマークにおけるさまざまな推論ケースを特定します。
GQA や A-OKVQA などの一般的なベンチマークでは、II-MMR は、VQA の質問のほとんどは単純に「シングルホップ」推論を要求するだけで簡単に回答できるのに対し、「マルチホップ」推論を必要とする質問はわずかであることを観察しています。
さらに、最近の V&L モデルは、従来の CoT 手法を使用した場合でも、このような複雑なマルチホップ推論の質問に苦労していますが、II-MMR は、ゼロショット設定と微調整設定の両方で、すべての推論ケースにわたってその有効性を示しています。

要約(オリジナル)

Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model’s overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding ‘single-hop’ reasoning, whereas only a few questions require ‘multi-hop’ reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.

arxiv情報

著者	Jihyung Kil,Farideh Tavazoee,Dongyeop Kang,Joo-Kyung Kim
発行日	2024-05-31 17:30:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー