Semi-supervised Neural Machine Translation with Consistency Regularization for Low-Resource Languages

要約

タイトル – 低資源言語のための一貫性正則化付き半教師付きニューラル機械翻訳

要約 –
– 深層学習の台頭に伴い、機械翻訳の分野は大きく発展してきている。
– しかしながら多くの研究は、大量の並列データが必要であり、特に低資源の言語には不足している。
– この論文では、高品質な文ペアを拡張し、NMTモデルを半教師有り学習で訓練する事で、低資源言語の問題を解決するための簡単かつ効果的な方法を提供する。
– 具体的には、教師付き学習のための交差エントロピー損失と、モデルから派生した疑似ターゲット文と拡張されたターゲット文が無教師モードのKLダイバージェンスを組み合わせる。
– また、SentenceBERTに基づいたフィルターを導入することで、意味的に類似した文ペアを保持することで、拡張データの品質を高める。
– 実験結果は、我々の手法が特に0.46-2.03 BLEUスコアの低資源データセットにおいてNMTベースラインを大幅に改善することを示しており、補助データの無教師学習を使用することが、教師付き学習で正解のターゲット文を再利用するよりも効率的であることを実証している。

要約(オリジナル)

The advent of deep learning has led to a significant gain in machine translation. However, most of the studies required a large parallel dataset which is scarce and expensive to construct and even unavailable for some languages. This paper presents a simple yet effective method to tackle this problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences derived from the model. We also introduce a SentenceBERT-based filter to enhance the quality of augmenting data by retaining semantically similar sentence pairs. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46–2.03 BLEU scores. We also demonstrate that using unsupervised training for augmented data is more efficient than reusing the ground-truth target sentences for supervised learning.

arxiv情報

著者	Viet H. Pham,Thang M. Pham,Giang Nguyen,Long Nguyen,Dien Dinh
発行日	2023-04-02 15:24:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Semi-supervised Neural Machine Translation with Consistency Regularization for Low-Resource Languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー