Tokenization and Morphology in Multilingual Language Models: A~Comparative Analysis of mT5 and ByT5

要約

形態論はトークン化に直接的な課題をもたらすため、多言語言語モデリングにとって重要な要素です。
ここでは、トークン化が多言語言語モデルにエンコードされた形態学的知識にどのような影響を与えるかを理解しようとします。
具体的には、mT5 と ByT5 という 2 つの多言語言語モデルを対比することで、トークン化の影響を捉えます。
2 つのモデルは同じアーキテクチャ、トレーニング目標、トレーニングデータを共有し、トークン化戦略 (サブワードトークン化と文字レベルトークン化) のみが異なります。
4 つのタスクと 17 の言語について、これらのモデルにエンコードされた形態学的知識を調査したところ、多言語言語モデルは平均パフォーマンスが同等であるにもかかわらず、一部の言語の形態学的システムを他の言語よりもよく学習し、形態学的情報は中間層と後期層でエンコードされていることがわかりました。
特性ベースのモデルでは、相応のプローブ精度を得るためにさらにいくつかのレイヤーが必要です。
最後に、不規則性が多い言語ほど、事前トレーニングデータの占有率が高いことでより多くのメリットが得られることを示します。

要約(オリジナル)

Morphology is a crucial factor for multilingual language modeling as it poses direct challenges for tokenization. Here, we seek to understand how tokenization influences the morphological knowledge encoded in multilingual language models. Specifically, we capture the impact of tokenization by contrasting two multilingual language models: mT5 and ByT5. The two models share the same architecture, training objective, and training data and only differ in their tokenization strategies: subword tokenization vs. character-level tokenization. Probing the morphological knowledge encoded in these models on four tasks and 17 languages, our analyses show that multilingual language models learn the morphological systems of some languages better than others despite similar average performance and that morphological information is encoded in the middle and late layers, where characted-based models need a few more layers to yield commensurate probing accuracy. Finally, we show that languages with more irregularities benefit more from having a higher share of the pre-training data.

arxiv情報

著者	Thao Anh Dang,Limor Raviv,Lukas Galke
発行日	2024-10-15 14:14:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tokenization and Morphology in Multilingual Language Models: A~Comparative Analysis of mT5 and ByT5

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー