Sub-Character Tokenization for Chinese Pretrained Language Models

要約

トークン化は、事前トレーニング済み言語モデル (PLM) の基本です。
中国語 PLM の既存のトークン化方法は、通常、各文字を分割できないトークンとして扱います。
しかし、彼らは追加の言語情報が文字レベルよりも下、つまりサブ文字レベルに存在する中国語の書記体系のユニークな機能を無視しています。
このような情報を活用するために、サブキャラクター (略して SubChar) のトークン化を提案します。
具体的には、最初に各漢字をそのグリフまたは発音に基づいて短いシーケンスに変換することによって入力テキストをエンコードし、次にサブワードセグメンテーションを使用してエンコードされたテキストに基づいて語彙を構築します。
実験結果によると、SubChar トークナイザーには既存のトークナイザーよりも 2 つの主な利点があります。1) 入力をはるかに短いシーケンスにトークン化できるため、計算効率が向上します。
2) 発音ベースの SubChar トークナイザーは、中国語の同音異義語を同じ音訳シーケンスにエンコードし、同じトークン化出力を生成できるため、同音異義語のタイプミスに対して堅牢です。
同時に、SubChar トークナイザーでトレーニングされたモデルは、ダウンストリームタスクで競争力のあるパフォーマンスを発揮します。
今後の作業を容易にするために、https://github.com/thunlp/SubCharTokenization でコードとモデルをリリースします。

要約(オリジナル)

Tokenization is fundamental to pretrained language models (PLMs). Existing tokenization methods for Chinese PLMs typically treat each character as an indivisible token. However, they ignore the unique feature of the Chinese writing system where additional linguistic information exists below the character level, i.e., at the sub-character level. To utilize such information, we propose sub-character (SubChar for short) tokenization. Specifically, we first encode the input text by converting each Chinese character into a short sequence based on its glyph or pronunciation, and then construct the vocabulary based on the encoded text with sub-word segmentation. Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to homophone typos. At the same time, models trained with SubChar tokenizers perform competitively on downstream tasks. We release our code and models at https://github.com/thunlp/SubCharTokenization to facilitate future work.

arxiv情報

著者	Chenglei Si,Zhengyan Zhang,Yingfa Chen,Fanchao Qi,Xiaozhi Wang,Zhiyuan Liu,Yasheng Wang,Qun Liu,Maosong Sun
発行日	2023-02-14 21:07:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sub-Character Tokenization for Chinese Pretrained Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー