Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

要約

テキストベースの視覚的質問生成、つまりTextVQGの新しい問題を提示します。
テキスト理解と会話型人工知能 (テキストベースの視覚的質問応答など) を組み合わせることに対するドキュメント画像分析コミュニティの最近の関心が高まっていることを考えると、TextVQG は重要なタスクになります。
TextVQG は、OCR トークンが生成された質問に対する回答となるように、特定の入力画像に対する自然言語の質問と、そこから OCR トークンとしても知られる自動的に抽出されたテキストを生成することを目的としています。
TextVQG は、会話型エージェントに不可欠な機能です。
ただし、シーンの詳細な理解と、視覚的なコンテンツと画像内のテキストを意味的に橋渡しする能力が必要なため、困難です。
TextVQG に対処するために、視覚的なコンテンツを調べ、シーンのテキストを読み取り、関連性のある意味のある自然言語の質問をする、OCR 一貫性のある視覚的な質問生成モデルを提示します。
提案したモデルを OLRA と呼びます。
2 つの公開ベンチマークで OLRA の広範な評価を実行し、それらをベースラインと比較します。
私たちのモデル OLRA は、手動でキュレーションされた公開テキストベースの視覚的質問応答データセットに似た質問を自動的に生成します。
さらに、テキスト生成の文献で一般的に使用されているパフォーマンス測定値では、ベースラインアプローチよりも大幅に優れています。

要約(オリジナル)

We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question. TextVQG is an essential ability for a conversational agent. However, it is challenging as it requires an in-depth understanding of the scene and the ability to semantically bridge the visual content with the text present in the image. To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question. We refer to our proposed model as OLRA. We perform an extensive evaluation of OLRA on two public benchmarks and compare them against baselines. Our model OLRA automatically generates questions similar to the public text-based visual question answering datasets that were curated manually. Moreover, we significantly outperform baseline approaches on the performance measures popularly used in text generation literature.

arxiv情報

著者	Soumya Jahagirdar,Shankar Gangisetty,Anand Mishra
発行日	2022-11-23 13:52:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー