A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

要約

知識ベースのビジョン質問応答（KB-VQA）は、視覚的およびテキストの入力を理解するだけでなく、広範な知識の理解を必要とするだけでなく、さまざまな現実世界のアプリケーションにわたる大幅な進歩を可能にすることで、一般的なビジョン質問応答（VQA）を拡張します。
KB-VQAは、多様なモダリティとソースからの異種情報の整合、騒々しいまたは大規模なリポジトリからの関連知識の検索、複合的なコンテキストからの回答を推測する複雑な推論の実行など、独自の課題を導入します。
大規模な言語モデル（LLMS）の進歩により、KB-VQAシステムも顕著な変換を受けており、LLMSは強力な知識リポジトリ、検索された高度発電機、強力な推論として機能します。
実質的な進歩にもかかわらず、既存のKB-VQAメソッドを体系的に整理およびレビューする包括的な調査は現在存在しません。
この調査の目的は、KB-VQAアプローチの構造化された分類法を確立し、システムをメインステージ（知識表現、知識の検索、および知識推論）に分類することにより、このギャップを埋めることを目的としています。
さまざまな知識統合技術を調査し、永続的な課題を特定することにより、この作業は、将来の研究の方向性を約束し、KB-VQAモデルとそのアプリケーションを推進するための基盤を提供することを概説しています。

要約(オリジナル)

Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.

arxiv情報

著者	Jiaqi Deng,Zonghan Wu,Huan Huo,Guandong Xu
発行日	2025-04-24 13:37:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー