Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

要約

視覚的な質問応答 (VQA) は、特に視覚言語モデル (VLM) の一般化機能が向上した後、ユーザーエクスペリエンスの鍵となっています。
しかし、実際の設定で標準化されたフレームワークを使用して、アプリケーション要件に合わせて VLM を評価することは依然として困難です。
このペーパーは、エンドツーエンドのフレームワークを使用してそれを解決することを目的としています。
VQA360 は、確立された VQA ベンチマークから派生した新しいデータセットで、包括的な評価のためにタスクタイプ、アプリケーションドメイン、知識タイプの注釈が付けられています。
また、GPT-4o を使用して開発されたマルチモーダル評価指標である GoEval も紹介し、人間の判断との相関係数 56.71% を達成しました。
最先端の VLM を使用した実験では、単一のモデルが普遍的に優れているということはなく、したがって、正しい選択が重要な設計上の決定となることが明らかになりました。
Gemini-1.5-Pro や GPT-4o-mini などの独自モデルは一般に他のモデルよりも優れていますが、InternVL-2-8B や CogVLM-2-Llama-3-19B などのオープンソースモデルも競争力を発揮し、追加の利点を提供します。
私たちのフレームワークは他のタスクにも拡張できます。

要約(オリジナル)

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 – a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.

arxiv情報

著者	Neelabh Sinha,Vinija Jain,Aman Chadha
発行日	2024-12-10 14:43:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー