EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

要約

完全な多言語ニューラル機械翻訳 (C-MNMT) は、多方向に整列されたコーパスを構築することによって、つまり、ソース側またはターゲット側のいずれかが同一である場合に、異なる言語ペアからのバイリンガルトレーニングサンプルを整列させることにより、従来の MNMT に対して優れたパフォーマンスを実現します。
ただし、異なる言語ペアからのまったく同じ文は稀であるため、多方向に整列されたコーパスの能力はその規模によって制限されます。
この問題に対処するために、本論文では、バイリンガルデータから大規模かつ高品質の多方向整列コーパスを構築するための 2 段階のアプローチである「抽出と生成」(EAG) を提案します。
具体的には、まず、類似性の高いソース文またはターゲット文を持つ異なる言語ペアからのバイリンガルの例をペアにすることにより、整列された例の候補を抽出します。
次に、よく訓練された生成モデルを使用して、候補から最終的に調整されたサンプルを生成します。
この 2 段階のパイプラインにより、EAG は元の対訳コーパスとほぼ同じ多様性を持つ大規模で多方向に整列されたコーパスを構築できます。
2 つの公的に利用可能なデータセット、つまり WMT-5 と OPUS-100 での実験では、提案された方法が強力なベースラインを超えて大幅な改善を達成し、2 つのデータセットでそれぞれ +1.1 および +1.4 BLEU ポイントの改善が得られたことが示されています。

要約(オリジナル)

Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes ‘Extract and Generate’ (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.

arxiv情報

著者	Yulin Xu,Zhen Yang,Fandong Meng,JieZhou
発行日	2024-07-22 09:22:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー