An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them

要約

世紀前のテキストが何世紀にもわたって受け継がれると、エラーは必然的に発生します。
これらのエラーは、識別するのが難しい場合があります。一部のエラーは、非常にとらえどころのないという理由で、まさに長い間検出されていないためです。
以前の作業では、人為的に生成されたエラーに関するエラー検出方法を評価していますが、プレアマンギリシャ語の実際のエラーの最初のデータセットを導入し、何世紀にもわたるコピープロセスである段階で真に蓄積されたエラーに関するエラー検出方法の評価を可能にします。
このデータセットを作成するために、BERT条件から派生したメトリックを使用して、エラーを含む可能性が高い1,000語をサンプリングします。これには、ドメインの専門家によってエラーとラベル付けされます。
次に、新しいエラー検出方法を提案して評価し、差別因子ベースの検出器が他のすべての方法を上回り、実際のエラーを5％分類するための真の正のレートを改善することがわかります。
さらに、スクリバルエラーは、印刷またはデジタル化エラーよりも検出が困難であることがわかります。
データセットを使用すると、前近代テキストの実際のエラーに関するエラー検出方法の評価を初めて使用でき、より効果的なエラー検出アルゴリズムを開発するためのベンチマークを提供して、学者が前近代作業の復元を支援します。

要約(オリジナル)

As premodern texts are passed down over centuries, errors inevitably accrue. These errors can be challenging to identify, as some have survived undetected for so long precisely because they are so elusive. While prior work has evaluated error detection methods on artificially-generated errors, we introduce the first dataset of real errors in premodern Greek, enabling the evaluation of error detection methods on errors that genuinely accumulated at some stage in the centuries-long copying process. To create this dataset, we use metrics derived from BERT conditionals to sample 1,000 words more likely to contain errors, which are then annotated and labeled by a domain expert as errors or not. We then propose and evaluate new error detection methods and find that our discriminator-based detector outperforms all other methods, improving the true positive rate for classifying real errors by 5%. We additionally observe that scribal errors are more difficult to detect than print or digitization errors. Our dataset enables the evaluation of error detection methods on real errors in premodern texts for the first time, providing a benchmark for developing more effective error detection algorithms to assist scholars in restoring premodern works.

arxiv情報

著者	Creston Brooks,Johannes Haubold,Charlie Cowen-Breen,Jay White,Desmond DeVaul,Frederick Riemenschneider,Karthik Narasimhan,Barbara Graziosi
発行日	2025-03-31 20:00:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Annotated Dataset of Errors in Premodern Greek and Baselines for Detecting Them

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー