DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

要約

このホワイトペーパーでは、新しい事前トレーニング済み言語モデル DeBERTaV3 を紹介します。これは、マスク言語モデリング (MLM) を、よりサンプル効率の高い事前トレーニングタスクである置換トークン検出 (RTD) に置き換えることで、元の DeBERTa モデルを改善します。
私たちの分析によると、ELECTRA での標準的な埋め込み共有は、トレーニングの効率とモデルのパフォーマンスに悪影響を及ぼします。
これは、ディスクリミネーターとジェネレーターのトレーニング損失が異なる方向にトークンの埋め込みを引き出し、「綱引き」ダイナミクスを作成するためです。
したがって、綱引きのダイナミクスを回避し、トレーニング効率と事前トレーニング済みモデルの品質の両方を改善する、新しい勾配を解きほぐした埋め込み共有方法を提案します。
DeBERTaと同じ設定を使用してDeBERTaV3を事前トレーニングし、幅広いダウンストリームの自然言語理解（NLU）タスクで卓越したパフォーマンスを実証しました。
例として 8 タスクの GLUE ベンチマークを取り上げると、DeBERTaV3 Large モデルは 91.37% の平均スコアを達成し、これは DeBERTa より 1.37%、ELECTRA より 1.91% 高く、モデル間で新しい最先端 (SOTA) を設定しています。
類似の構造を持つ。
さらに、多言語モデル mDeBERTa を事前トレーニングし、英語モデルと比較して強力なベースラインでより大きな改善を観察しました。
たとえば、mDeBERTa Base は、XNLI で 79.8% のゼロショットクロスリンガル精度を達成し、XLM-R Base より 3.6% 向上しており、このベンチマークで新しい SOTA を作成しています。
事前トレーニング済みのモデルと推論コードは、https://github.com/microsoft/DeBERTa で公開されています。

要約(オリジナル)

This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the ‘tug-of-war’ dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at https://github.com/microsoft/DeBERTa.

arxiv情報

著者	Pengcheng He,Jianfeng Gao,Weizhu Chen
発行日	2023-03-24 09:17:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー