Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

要約

バイトペアエンコーディング (BPE) は、自然言語処理 (NLP) 分野におけるテキストトークン化の基礎メソッドとして機能します。
広く採用されているにもかかわらず、元の BPE アルゴリズムには固有の欠陥が存在します。それは、テキストコーパス内のトークンの頻度の不均衡を誤って引き起こしてしまうというものです。
BPE は、テキストコーパス内の最も頻繁に使用されるトークンペアを繰り返しマージして新しいトークンを生成し、生成されたすべてのトークンを語彙内に保持するため、主に長いトークンのコンポーネントとして機能し、単独で出現することはほとんどないトークンを保持することは避けられません。
このようなトークンを足場トークンと呼びます。
テキストコーパス内での出現頻度が低いため、スキャフォールドトークンは学習の不均衡の問題を引き起こします。
この問題に対処するために、私たちは Scaffold-BPE を提案します。これは、元の BPE メソッドに対するパラメーター不要で計算量が少なく、実装が簡単な変更による動的なスキャフォールドトークン削除メカニズムを組み込みます。
この新しいアプローチにより、特定のテキストのトークン表現から低頻度のスキャフォールドトークンが確実に除外されるため、頻度の不均衡の問題が軽減され、モデルのトレーニングが容易になります。
言語モデリング、さらには機械翻訳にわたる広範な実験において、Scaffold-BPE は一貫して元の BPE を上回り、その有効性を十分に実証しました。

要約(オリジナル)

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus to generate a new token and keeps all generated tokens in the vocabulary, it unavoidably holds tokens that primarily act as components of a longer token and appear infrequently on their own. We term such tokens as Scaffold Tokens. Due to their infrequent occurrences in the text corpus, Scaffold Tokens pose a learning imbalance issue. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness.

arxiv情報

著者	Haoran Lian,Yizhe Xiong,Jianwei Niu,Shasha Mo,Zhenpeng Su,Zijia Lin,Hui Chen,Peng Liu,Jungong Han,Guiguang Ding
発行日	2024-11-13 08:51:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー