Comparative analysis of subword tokenization approaches for Indian languages

要約

トークン化とは、マシンが処理しやすい小さな部品またはトークンにテキストを分解する行為です。
これは、機械翻訳（MT）モデルの重要な段階です。
サブワードトークン化は、単語をより小さなサブワード単位に分解することによりこのプロセスを強化します。これは、複雑な形態または広大な語彙を持つ言語で特に有益です。
これは、接頭辞、接尾辞、その他の形態学的なバリエーションなど、インド語（IL）の単語の複雑な構造をキャプチャするのに役立ちます。
これらの言語は、凝集構造を頻繁に使用します。この構造では、接尾辞、接頭辞、茎などの複数の形態素の組み合わせによって単語が形成されます。
その結果、これらのシナリオに対処するために、適切なトークン化戦略を選択する必要があります。
このホワイトペーパーでは、文章、バイトペアエンコード（BPE）、およびワードピーストークン化などの異なるサブワードトークン化手法がILSにどのように影響するかを検討します。
これらのサブワードトークン化手法の有効性は、統計、神経、多言語の神経機械翻訳モデルで調査されています。
すべてのモデルは、バイリンガル評価アンテナディュディー（BLE）スコア、TER、MetEor、CHRF、Ribes、Cometなどの標準的な評価メトリックを使用して調査されます。
結果に基づいて、統計およびニューラルMTモデルの言語ペアの大部分では、BLEUスコアの観点から他のトークンザーよりも継続的にパフォーマンスが継続的に実行されたようです。
ただし、BPEトークン化は、多言語の神経機械翻訳モデルのコンテキストで他のトークン化手法よりも優れていました。
結果は、各モデルに同じトークン剤とデータセットを使用しているにもかかわらず、ILから英語への翻訳が英語からILSへの翻訳を上回っていることを示しています。

要約(オリジナル)

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.

arxiv情報

著者	Sudhansu Bala Das,Samujjal Choudhury,Tapas Kumar Mishra,Bidyut Kr. Patra
発行日	2025-05-22 16:24:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comparative analysis of subword tokenization approaches for Indian languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー