From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time

要約

最高のパフォーマンスの変圧器ベースの言語モデルは、バイトペアエンコード（BPE）などのサブワードトークン化手法を使用します。
ただし、これらのアプローチは、言語固有の単語構造を理解するための基本であると考えている形態学的セグメンテーションなど、言語の原則を見落としていることがよくあります。
この研究では、注釈付きのデンマークの形態データセットを活用して、形態学的セグメンテーションのために半承認されたモデルを訓練し、デンマークの形態に最適化されたトークン剤の開発を可能にします。
デンマークの単語を形態学的にセグメント化するパフォーマンスを分析することにより、2つのカスタム形態学的トークナイザーを含む4つの異なるトークナザーを評価します。
さらに、これらのトーナイザーを使用して、\ textit {cerebrasgpt-11m}と\ textit {llama-3.2 1b}と\ textit {cerebrasgpt-11m}と\ textit {cerebrasgpt-11m}をトレーニングし、下流のパフォーマンスを評価します。
私たちの調査結果は、カスタム開発されたトークンザーが形態学的セグメンテーションを大幅に強化し、デンマークのBPEトークナイザーによって達成された39.28と比較して58.84のF1スコアを達成することを明らかにしています。
ダウンストリームタスクでは、形態学的トークンザーで訓練されたモデルは、さまざまな評価メトリックでBPEトーナイザーを使用しているモデルよりも優れています。
これらの結果は、デンマークの形態学的セグメンテーション戦略をトークンザーに組み込むと、デンマーク語の生成トランスモデルのパフォーマンスが向上することを強調しています

要約(オリジナル)

The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we believe is fundamental for understanding language-specific word structure. In this study, we leverage an annotated Danish morphological dataset to train a semisupervised model for morphological segmentation, enabling the development of tokenizers optimized for Danish morphology. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance in morphologically segmenting Danish words. Additionally, we train two generative transformer models, \textit{CerebrasGPT-111M} and \textit{LLaMA-3.2 1B}, using these tokenizers and evaluate their downstream performance. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer. In downstream tasks, models trained with our morphological tokenizers outperform those using BPE tokenizers across different evaluation metrics. These results highlight that incorporating Danish morphological segmentation strategies into tokenizers leads to improved performance in generative transformer models on Danish language

arxiv情報

著者	Mikkel Wildner Kildeberg,Emil Allerslev Schledermann,Nicolaj Larsen,Rob van der Goot
発行日	2025-04-02 09:26:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー