Vision-Language Model Based Handwriting Verification

要約

手書き検証は文書フォレンジックにおいて非常に重要です。
深層学習ベースのアプローチは、説明可能性の欠如と広範なトレーニングデータや手作りの特徴への依存により、法医学文書検査官から懐疑的な見方を受けることがよくあります。
このペーパーでは、OpenAI の GPT-4o や Google の PaliGemma などのビジョン言語モデル (VLM) を使用して、これらの課題に対処する方法について検討します。
ビジュアル質問応答機能とゼロショット思考連鎖 (CoT) 推論を活用することで、私たちの目標は、モデルの決定について明確で人間が理解できる説明を提供することです。
CEDAR 手書きデータセットに関する私たちの実験では、VLM が解釈可能性を高め、大規模なトレーニングデータセットの必要性を減らし、多様な手書きスタイルにうまく適応できることが実証されました。
ただし、結果は、CNN ベースの ResNet-18 アーキテクチャが、GPT-4o (精度: 70%) および教師付き微調整 PaliGemma (精度: 71%) を使用した 0 ショット CoT プロンプトエンジニアリングアプローチよりも優れており、精度 84 を達成していることを示しています。
CEDAR AND データセットの %。
これらの発見は、人間が解釈可能な意思決定を生成する際の VLM の可能性を強調すると同時に、特化された深層学習モデルのパフォーマンスに匹敵するさらなる進歩の必要性を強調しています。

要約(オリジナル)

Handwriting Verification is a critical in document forensics. Deep learning based approaches often face skepticism from forensic document examiners due to their lack of explainability and reliance on extensive training data and handcrafted features. This paper explores using Vision Language Models (VLMs), such as OpenAI’s GPT-4o and Google’s PaliGemma, to address these challenges. By leveraging their Visual Question Answering capabilities and 0-shot Chain-of-Thought (CoT) reasoning, our goal is to provide clear, human-understandable explanations for model decisions. Our experiments on the CEDAR handwriting dataset demonstrate that VLMs offer enhanced interpretability, reduce the need for large training datasets, and adapt better to diverse handwriting styles. However, results show that the CNN-based ResNet-18 architecture outperforms the 0-shot CoT prompt engineering approach with GPT-4o (Accuracy: 70%) and supervised fine-tuned PaliGemma (Accuracy: 71%), achieving an accuracy of 84% on the CEDAR AND dataset. These findings highlight the potential of VLMs in generating human-interpretable decisions while underscoring the need for further advancements to match the performance of specialized deep learning models.

arxiv情報

著者	Mihir Chauhan,Abhishek Satbhai,Mohammad Abuzar Hashemi,Mir Basheer Ali,Bina Ramamurthy,Mingchen Gao,Siwei Lyu,Sargur Srihari
発行日	2024-07-31 17:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language Model Based Handwriting Verification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー