Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation

要約

多言語ニューラル機械翻訳 (MNMT) は、並列データのみを使用して限られたパラメーターでモデルをトレーニングすることにより、複数の言語にわたる任意の翻訳を可能にします。
ただし、このような MNMT モデルのパフォーマンスは依然として大規模言語モデル (LLM) のパフォーマンスに及ばず、実用性が制限されています。
この研究では、デコーダ専用の新しい最先端の MNMT モデルを実現するための登録を導入することで、この制限に対処します。
具体的には、レジスタと呼ばれる、ターゲット言語を指定する一連の人工トークンを、ソーストークンとターゲットトークンの間の入力シーケンスに挿入します。
アテンションマスクを変更することにより、ターゲットトークンの生成では、ターゲット言語空間内のソーストークンを表すレジスタのアクティブ化のみに注意が払われます。
大規模ベンチマークである EC-40 での実験では、私たちの手法が多言語表現の最適化によって駆動される関連手法よりも優れていることが示されています。
さらにスケールアップして、公開データセットから 24 言語にわたる 93 億の文ペアを収集し、2 つのモデル、つまり MITRE (レジスタを使用した多言語翻訳) を事前トレーニングします。
そのうちの 1 つである MITRE-913M は、NLLB-3.3B を上回り、市販の LLM と同等のパフォーマンスを達成し、微調整において強力な適応性を示します。
最後に、MNMT でのさらなる研究開発を促進するために、モデルをオープンソース化します (https://github.com/zhiqu22/mitre)。

要約(オリジナル)

The multilingual neural machine translation (MNMT) enables arbitrary translations across multiple languages by training a model with limited parameters using parallel data only. However, the performance of such MNMT models still lags behind that of large language models (LLMs), limiting their practicality. In this work, we address this limitation by introducing registering to achieve the new state-of-the-art of decoder-only MNMT models. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method outperforms related methods driven by optimizing multilingual representations. We further scale up and collect 9.3 billion sentence pairs across 24 languages from public datasets to pre-train two models, namely MITRE (multilingual translation with registers). One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.

arxiv情報

著者	Zhi Qu,Yiran Wang,Jiannan Mao,Chenchen Ding,Hideki Tanaka,Masao Utiyama,Taro Watanabe
発行日	2025-01-06 12:42:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー