Target Prompting for Information Extraction with Vision Language Model

要約

Large Vision and Language モデルの最近の傾向は、情報抽出システムの構築方法に新たな変化をもたらしました。
VLM は、さまざまな業界にわたる文書の理解と質問応答システムの構築において、最先端の技術で新たなベンチマークを設定しました。
文書画像からテキストを生成し、質問に対して正確な回答を提供することが大幅に優れています。
ただし、これらのモデルを効果的に活用して正確な会話システムを構築するには、まだいくつかの課題があります。
大規模な言語モデルで使用される一般的なプロンプト手法は、多くの場合、これらの特別に設計されたビジョン言語モデルには適していません。
このような一般的な入力プロンプトによって生成される出力は通常のものであり、ドキュメントの実際の内容と比較すると、情報のギャップが含まれる可能性があります。
より正確で具体的な回答を得るには、視覚言語モデルと文書画像に的を絞ったプロンプトが必要です。
この論文では、文書画像の一部を明示的にターゲットにし、それらの特定の領域のみから関連する回答を生成することに焦点を当てたターゲットプロンプティングと呼ばれる手法について説明します。
この論文では、さまざまなユーザークエリと入力プロンプトを使用した各プロンプト手法に対する応答の評価についても説明します。

要約(オリジナル)

The recent trend in the Large Vision and Language model has brought a new change in how information extraction systems are built. VLMs have set a new benchmark with their State-of-the-art techniques in understanding documents and building question-answering systems across various industries. They are significantly better at generating text from document images and providing accurate answers to questions. However, there are still some challenges in effectively utilizing these models to build a precise conversational system. General prompting techniques used with large language models are often not suitable for these specially designed vision language models. The output generated by such generic input prompts is ordinary and may contain information gaps when compared with the actual content of the document. To obtain more accurate and specific answers, a well-targeted prompt is required by the vision language model, along with the document image. In this paper, a technique is discussed called Target prompting, which focuses on explicitly targeting parts of document images and generating related answers from those specific regions only. The paper also covers the evaluation of response for each prompting technique using different user queries and input prompts.

arxiv情報

著者	Dipankar Medhi
発行日	2024-08-07 15:17:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Target Prompting for Information Extraction with Vision Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー