Fine-Grained Retrieval-Augmented Generation for Visual Question Answering

要約

視覚的な質問回答（VQA）は、画像からの情報を利用することにより、自然言語の質問への回答を提供することに焦点を当てています。
GPT-4oなどの最先端のマルチモーダル大型言語モデル（MLLM）は、VQAタスクで強力なパフォーマンスを実現しますが、ドメイン固有または最新の知識にアクセスするのに頻繁に不足しています。
この問題を緩和するために、KB-VQAと呼ばれる外部知識ベース（KBS）を活用する検索された生成（RAG）は、有望なアプローチとして浮上しています。
それにもかかわらず、画像をテキストの説明に変換する従来の単像検索手法は、多くの場合、重要な視覚的詳細を失います。
この研究では、テキストのスニペットとベクトルデータベースに保存されているエンティティ画像と融合したファイングレインの知識ユニットを紹介します。
さらに、細粒の検索をMLLMSと統合するナレッジユニット検索の高級ジェネレーションフレームワーク（KU-RAG）を導入します。
提案されたKU-RAGフレームワークは、関連する知識の正確な検索を保証し、知識修正チェーンを通じて推論機能を強化します。
実験的調査結果は、私たちのアプローチが主要なKB-VQAメソッドのパフォーマンスを大幅に向上させ、最良の場合に平均3％と最大11％の平均改善を達成することを示しています。

要約(オリジナル)

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases. Furthermore, we introduce a knowledge unit retrieval-augmented generation framework (KU-RAG) that integrates fine-grained retrieval with MLLMs. The proposed KU-RAG framework ensures precise retrieval of relevant knowledge and enhances reasoning capabilities through a knowledge correction chain. Experimental findings demonstrate that our approach significantly boosts the performance of leading KB-VQA methods, achieving an average improvement of approximately 3% and up to 11% in the best case.

arxiv情報

著者	Zhengxuan Zhang,Yin Wu,Yuyu Luo,Nan Tang
発行日	2025-04-11 16:02:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fine-Grained Retrieval-Augmented Generation for Visual Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー