Language Models over Canonical Byte-Pair Encodings

要約

現代の言語モデルは、バイトペアエンコーディングなどの決定論的トークン剤を介して導出された（より短い）トークン文字列の分布として、文字文字列上の確率分布を表します。
このアプローチは言語モデルを大規模にスケールアップするのに非常に効果的ですが、現在の化身には懸念があります。モデルは、各キャラクターの文字列のゼロ以外の確率マスを指数数に割り当てます{noncanical} $トークンエンコーディングのエンコーディングを割り当てます。
大きい）。
この誤配分はどちらも誤りがあります。非標準文字列は、トレーニングデータには決して現れず、浪費された確率の質量をもっともらしい出力から離します。
これらは回避可能な間違いです！
この作業では、トークンレベルの言語モデルで標準性を実施する方法を提案し、標準トークン文字列のみに正の確率が割り当てられるようにします。
（1）条件付けによる標準性、追加のトレーニングなしのテスト時間推論戦略を活用することによる標準性、および（2）建設による標準性、標準出力を保証するがトレーニングが必要なモデルパラメーター化。
標準性の間違いを修正すると、いくつかのモデルとコーパスのデータが保持される可能性が向上することを実証します。

要約(オリジナル)

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string — these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

arxiv情報

著者	Tim Vieira,Tianyu Liu,Clemente Pasti,Yahya Emara,Brian DuSell,Benjamin LeBrun,Mario Giulianelli,Juan Luis Gastaldi,Timothy J. O’Donnell,Ryan Cotterell
発行日	2025-06-09 17:26:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models over Canonical Byte-Pair Encodings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー