Visual question answering: from early developments to recent advances — a survey

要約

Visual Question Answering (VQA) は、特徴抽出、オブジェクト検出、テキスト埋め込み、自然言語理解、言語生成などの画像および言語処理技術を統合することで、機械がビジュアルコンテンツに関する質問に答えられるようにすることを目的とした進化する研究分野です。
マルチモーダルデータ研究の成長に伴い、VQA は、インタラクティブな教育ツール、医用画像診断、顧客サービス、エンターテイメント、ソーシャルメディアのキャプションなどの幅広い用途により大きな注目を集めています。
さらに、VQA は、画像から説明的なコンテンツを生成することで、視覚障害のある人を支援する上で重要な役割を果たします。
この調査では、VQA アーキテクチャの分類を導入し、比較分析と評価を容易にするために設計上の選択と主要なコンポーネントに基づいて分類します。
私たちは、深層学習ベースの手法に焦点を当てて主要な VQA アプローチをレビューし、VQA のようなマルチモーダルタスクで成功を収めているラージビジュアル言語モデル (LVLM) の新興分野を調査します。
この論文では、VQA システムのパフォーマンスを測定するために不可欠な利用可能なデータセットと評価指標をさらに調査し、その後、実際の VQA アプリケーションを調査します。
最後に、VQA 研究における現在進行中の課題と将来の方向性を強調し、未解決の疑問とさらなる開発の可能性のある領域を示します。
この調査は、最新の進歩と将来に関心のある研究者や実務家のための包括的なリソースとして機能します。

要約(オリジナル)

Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future

arxiv情報

著者	Ngoc Dung Huynh,Mohamed Reda Bouadjenek,Sunil Aryal,Imran Razzak,Hakim Hacid
発行日	2025-01-07 17:00:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual question answering: from early developments to recent advances — a survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー