Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

要約

常微分方程式 (ODE) の離散近似としての残差ネットワークは、マルチステップ法、高次法、およびマルチ粒子動的システムなどのニューラルネットワーク設計に大きな進歩をもたらしました。
ODE の解の精度はパラメーターの最適化に大きく影響し、それによってモデルのパフォーマンスに影響を与えます。
この研究では、真の「ソリューション」と比較して誤差を最小限に抑えるための、Transformer アーキテクチャ設計の一連の高度な調査を紹介します。まず、打ち切り誤差を最小限に抑えるための予測子補正学習フレームワークを紹介します。
次数予測器と多段階補正器。
次に、高次の予測子を強化するために、指数移動平均に基づく係数学習方法を提案します。
大規模な機械翻訳、抽象的な要約、言語モデリング、および自然言語理解ベンチマークに関する広範な実験により、私たちのアプローチの優位性が実証されました。
WMT’14 の英語-ドイツ語および英語-フランス語のタスクでは、モデルはそれぞれ 30.95 と 44.27 の BLEU スコアを達成しました。
さらに、OPUS 多言語機械翻訳タスクでは、私たちのモデルは、わずか 1/3 のパラメータを使用して、堅牢な 3.8B DeepNet を平均 2.9 SacreBLEU 上回りました。
特に、LM ハーネス評価では LLama モデルを 5.7 精度ポイント上回っています。

要約(オリジナル)

Residual networks, as discrete approximations of Ordinary Differential Equations (ODEs), have inspired significant advancements in neural network design, including multistep methods, high-order methods, and multi-particle dynamical systems. The precision of the solution to ODEs significantly affects parameter optimization, thereby impacting model performance. In this work, we present a series of advanced explorations of Transformer architecture design to minimize the error compared to the true “solution.” First, we introduce a predictor-corrector learning framework to minimize truncation errors, which consists of a high-order predictor and a multistep corrector. Second, we propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Extensive experiments on large-scale machine translation, abstractive summarization, language modeling, and natural language understanding benchmarks demonstrate the superiority of our approach. On the WMT’14 English-German and English-French tasks, our model achieved BLEU scores of 30.95 and 44.27, respectively. Furthermore, on the OPUS multilingual machine translation task, our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters. Notably, it also beats LLama models by 5.7 accuracy points on the LM Harness Evaluation.

arxiv情報

著者	Bei Li,Tong Zheng,Rui Wang,Jiahao Liu,Qingyan Guo,Junliang Guo,Xu Tan,Tong Xiao,Jingbo Zhu,Jingang Wang,Xunliang Cai
発行日	2024-11-05 12:26:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー