OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

要約

タイトル：OpenViVQA：ベトナム語におけるビジュアル・クエスチョンアンサリングのためのタスク、データセット、およびマルチモーダル融合モデル

要約：
– ビジュアル・クエスチョンアンサリングには、画像と質問から適切な回答を生成する能力を持つ方法が必要であり、その応用可能性が高いことから注目を浴びている。
– これまでの大規模データセットは、英語のようなリソース豊富な言語に偏っており、単純な回答選択タスクまたは回答分類タスクに限定されていた。
– 本論文では、初めてベトナム語でオープンエンドの回答を伴うビジュアル・クエスチョンアンサリングのための大規模データセット「OpenViVQA」を紹介し、FST、QuMLAG、およびMLPAGという複数のマルチモーダル融合モデルの提案を行い、これらのモデルが競合する既存のモデルと同等の結果を達成できることを示した。
– これらの成果により、低リソース言語であるベトナム語を含むより汎用性のあるアルゴリズムの開発を促すことが期待される。

要約(オリジナル)

In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering) dataset, the first large-scale dataset for VQA with open-ended answers in Vietnamese, consists of 11,000+ images associated with 37,000+ question-answer pairs (QAs). Moreover, we proposed FST, QuMLAG, and MLPAG which fuse information from images and answers, then use these fused features to construct answers as humans iteratively. Our proposed methods achieve results that are competitive with SOTA models such as SAAA, MCAN, LORA, and M4C. The dataset is available to encourage the research community to develop more generalized algorithms including transformers for low-resource languages such as Vietnamese.

arxiv情報

著者	Nghia Hieu Nguyen,Duong T. D. Vo,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
発行日	2023-05-07 03:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー