Historical German Text Normalization Using Type- and Token-Based Language Modeling

要約

スペルの歴史的なバリエーションは、歴史的なデジタル化されたテキストでのフルテキスト検索または自然言語処理に課題となります。
歴史的な正書法と現代的なスペルとのギャップを最小限に抑えるために、通常、歴史的資料の自動正式な正規化が追求されます。
このレポートは、cからのドイツ文学テキストの正規化システムを提案します。
1700-1900、平行コーパスで訓練された。
提案されたシステムは、トランス語モデルを使用した機械学習アプローチを使用し、エンコーダーデコーダーモデルを組み合わせて個々の単語タイプを正常化し、事前に訓練された因果言語モデルをコンテキスト内で調整します。
広範な評価は、提案されたシステムが、はるかに大きい完全なエンドツーエンドの文ベースの正規化システムに匹敵する最先端の精度を提供し、事前に訓練されたトランスの大手言語モデルを微調整することを示しています。
ただし、モデルが一般化するのが難しいため、および広範な高品質の並列データの欠如により、履歴テキストの正規化は依然として課題のままです。

要約(オリジナル)

Historic variations of spelling poses a challenge for full-text search or natural language processing on historical digitized texts. To minimize the gap between the historic orthography and contemporary spelling, usually an automatic orthographic normalization of the historical source material is pursued. This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus. The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context. An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model. However, the normalization of historical text remains a challenge due to difficulties for models to generalize, and the lack of extensive high-quality parallel data.

arxiv情報

著者	Anton Ehrmanntraut
発行日	2025-02-25 17:24:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Historical German Text Normalization Using Type- and Token-Based Language Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー