KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

要約

法的、財務、および政府のテキストのための専門のトークンザーのファミリーであるKL3Mトーナイザーを紹介します。
トークン化に関する確立された作業にもかかわらず、プロのドメイン向けの専門的なトークンザーは依然として研究されていません。
私たちの論文は、この分野への2つの主な貢献を提供しています。
まず、法的、財務、政府のテキストのためにドメイン固有のBPEトークナイザーを紹介します。
KL3M-004-128Kケースのトークネイザーは、より小さな語彙を持っているにもかかわらず、ドメイン固有のドキュメントにはGPT-4OおよびLLAMA3よりも9〜17％少ないトークンを使用します。
特殊な用語の場合、当社のケース型トークナイザーはさらに効率的であり、法的条件で最大83％少ないトークンを使用し、金融条件では39％少ないトークンを使用しています。
第二に、OCRポストプロセッシングなどのテキスト修正タスクのために、キャラクターレベルのBPEトークナイザー（4K、8K、および16Kの語彙サイズ）を開発します。
これらのトークンザーは、エラー含有テキストと正しいテキストの間の一貫したトークンの境界を維持するため、モデルが補正パターンを容易にします。
これらのトークンザーは、コンテキストウィンドウでより多くのテキストを適合させ、計算ニーズを減らし、ドメイン固有の用語の意味を維持することにより、専門的なアプリケーションを支援します。
私たちの分析は、これらの効率の向上が長い法的および財務文書の処理に直接利益をもたらすことを示しています。
GitHubとHugging Faceを介してすべてのトークンとコードをリリースし、専門のトークン化のさらなる研究をサポートします。

要約(オリジナル)

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.

arxiv情報

著者	Michael J Bommarito,Daniel Martin Katz,Jillian Bommarito
発行日	2025-03-21 15:51:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー