Problematic Tokens: Tokenizer Bias in Large Language Models

要約

GPT-4 や GPT-4o などの大規模言語モデル (LLM) の最近の進歩は、堅牢なトレーニングを保証する広範なデータセットのおかげで、特に英語などのリソースが豊富な言語で優れたパフォーマンスを示しています。
逆に、これらのモデルは、中国語や韓国語などのリソースが不足している言語を処理する場合には限界があり、幻覚反応などの問題が依然として蔓延しています。
この論文では、これらの差異の原因を、これらのモデルに固有のトークン化プロセスにまで遡ります。
具体的には、トークン化プロセスの高速化とトークンの削減によく使用されますが、実際のモデルのトレーニングデータとは独立して構築されたトークナイザーの語彙が、英語以外の言語をどのように適切に表現していないのかを調査します。
この虚偽の表示により、トレーニングが不十分またはトレーニングされていないトークンが拡散し、バイアスが永続化し、データセキュリティと倫理基準に関連する深刻な懸念が生じます。
私たちは、GPT-4o のトークン化メカニズムを詳しく分析し、その簡略化されたトークン処理方法がどのようにこれらのリスクを増幅し、関連するセキュリティと倫理の問題を軽減するための戦略的ソリューションを提供するかを説明することを目的としています。
この調査を通じて、私たちはより公平で安全な AI テクノロジーを促進するためにトークン化フレームワークを再考する重要な必要性を強調します。
コードとデータは https://github.com/yeyimilk/LLMGPT4o で入手できます。

要約(オリジナル)

Recent advancements in large language models(LLMs), such as GPT-4 and GPT-4o, have shown exceptional performance, especially in languages with abundant resources like English, thanks to extensive datasets that ensure robust training. Conversely, these models exhibit limitations when processing under-resourced languages such as Chinese and Korean, where issues including hallucinatory responses remain prevalent. This paper traces the roots of these disparities to the tokenization process inherent to these models. Specifically, it explores how the tokenizers vocabulary, often used to speed up the tokenization process and reduce tokens but constructed independently of the actual model training data, inadequately represents non-English languages. This misrepresentation results in the propagation of under-trained or untrained tokens, which perpetuate biases and pose serious concerns related to data security and ethical standards. We aim to dissect the tokenization mechanics of GPT-4o, illustrating how its simplified token-handling methods amplify these risks and offer strategic solutions to mitigate associated security and ethical issues. Through this study, we emphasize the critical need to rethink tokenization frameworks to foster more equitable and secure AI technologies. The code and data are available at https://github.com/yeyimilk/LLMGPT4o

arxiv情報

著者	Jin Yang,Zhiqiang Wang,Yanbin Lin,Zunduo Zhao
発行日	2024-11-14 03:53:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Problematic Tokens: Tokenizer Bias in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー