Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

要約

バイトペアエンコーディング (BPE) やバイトレベル BPE (BBPE) などのトークン化技術は、テキストをトークンに分割することにより、大規模言語モデル (LLM) の計算効率と語彙表現の安定性を大幅に向上させました。
ただし、このセグメンテーションにより、トークン内の内部の文字構造やシーケンスが不明瞭になることが多く、モデルがトレーニング中にこれらの複雑な詳細を完全に学習することができなくなります。
その結果、LLM は、特に限られたデータで下流のタスクを微調整する場合、トークン内の文字の構成と位置関係を理解するのに苦労します。
このペーパーでは、トークン内部位置認識 (TIPA) を紹介します。これは、トークナイザー独自の語彙を使用した逆文字予測タスクで LLM をトレーニングすることにより、LLM の内部トークン構造の理解を強化する新しいアプローチです。
この方法により、モデルは文字の位置と内部構造を効果的に学習し、一般化することができます。
実験結果は、TIPA でトレーニングされた LLM が、トークンレベルでの文字位置の予測においてベースラインモデルよりも優れていることを示しています。
さらに、中国語のスペル修正 (CSC) の下流タスクに適用すると、TIPA はモデルの収束を加速するだけでなく、タスクのパフォーマンスも大幅に向上します。

要約(オリジナル)

Tokenization techniques such as Byte-Pair Encoding (BPE) and Byte-Level BPE (BBPE) have significantly improved the computational efficiency and vocabulary representation stability of large language models (LLMs) by segmenting text into tokens. However, this segmentation often obscures the internal character structures and sequences within tokens, preventing models from fully learning these intricate details during training. Consequently, LLMs struggle to comprehend the character compositions and positional relationships within tokens, especially when fine-tuned on downstream tasks with limited data. In this paper, we introduce Token Internal Position Awareness (TIPA), a novel approach that enhances LLMs’ understanding of internal token structures by training them on reverse character prediction tasks using the tokenizer’s own vocabulary. This method enables models to effectively learn and generalize character positions and internal structures. Experimental results demonstrate that LLMs trained with TIPA outperform baseline models in predicting character positions at the token level. Furthermore, when applied to the downstream task of Chinese Spelling Correction (CSC), TIPA not only accelerates model convergence but also significantly improves task performance.

arxiv情報

著者	Zhu Xu,Zhiqiang Zhao,Zihan Zhang,Yuchi Liu,Quanwei Shen,Fei Liu,Yu Kuang
発行日	2024-11-26 18:44:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー