From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

要約

タイトル「画像からテキストのプロンプトへ：凍結された大規模言語モデルによるゼロショットVQA」

要約
– 大規模言語モデル（LLM）は、新しい言語タスクにおいて優れたゼロショット汎化を示している。
– しかし、LLMを使用したゼロショットビジュアル・クエスチョン・アンサリング（VQA）の効果的な利用は、モダリティの不一致とVQAタスクの両方の課題のために依然として難しい。
– この問題に対処するために、私たちは「Img2Prompt」というプラグインモジュールを提案する。
– Img2Promptは、画像の内容を説明するプロンプトと、自己構築の質問-回答ペアを提供し、LLMがゼロショットのVQAタスクを実行できるようにする。
– Img2Promptは、以下の利点を提供する。1）異なるLLMと柔軟に機能することができる。2）エンドツーエンドのトレーニングが不要であり、LLMを展開するコストを大幅に削減することができる。3）エンドツーエンドトレーニングに依存する方法よりも、同等または優れたパフォーマンスを実現する。

要約(オリジナル)

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.

arxiv情報

著者	Jiaxian Guo,Junnan Li,Dongxu Li,Anthony Meng Huat Tiong,Boyang Li,Dacheng Tao,Steven C. H. Hoi
発行日	2023-05-08 06:04:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー