HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

要約

Vision Question Answering (VQA) タスクは、画像を使用して重要な情報を伝え、テキストベースの質問に回答します。これは、現実世界のシナリオで最も一般的な質問応答形式の 1 つです。
現在、多数のビジョンテキストモデルが存在しており、特定の VQA タスクで良好なパフォーマンスを発揮しています。
ただし、これらのモデルでは、テキストの多い画像に対する人間の注釈を理解するのに大きな制限があります。
これに対処するために、ヒューマンアノテーションの理解と認識 (HAUR) タスクを提案します。
この取り組みの一環として、5 つの一般的なタイプのヒューマンアノテーションを網羅する Human Annotation Understanding and Recognition-5 (HAUR-5) データセットを導入します。
さらに、モデル OCR-Mix を開発し、トレーニングしました。
包括的なモデル間の比較を通じて、私たちの結果は、OCR-Mix がこのタスクにおいて他のモデルよりも優れていることを示しています。
私たちのデータセットとモデルは間もなくリリースされる予定です。

要約(オリジナル)

Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .

arxiv情報

著者	Yuchen Yang,Haoran Yan,Yanhao Chen,Qingqiang Wu,Qingqi Hong
発行日	2024-12-24 10:25:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー