Improving BERT with Hybrid Pooling Network and Drop Mask

要約

BERT などのトランスフォーマーベースの事前トレーニング済み言語モデルは、さまざまな自然言語理解タスクで大きな成功を収めています。
以前の研究では、BERT がさまざまな層で言語情報の豊富な階層を取得することがわかっています。
ただし、バニラ BERT は、各レイヤーに同じセルフアテンションメカニズムを使用して、さまざまなコンテキストの特徴をモデル化します。
この論文では、セルフアテンションネットワークとプーリングネットワークを組み合わせて、各層で異なるコンテキスト特徴をエンコードする HybridBERT モデルを提案します。
さらに、マスク言語モデリングの事前トレーニング中に特別なマスクトークンを過剰に使用することによって引き起こされる事前トレーニングと微調整の間の不一致に対処するための単純な DropMask メソッドを提案します。
実験の結果、HybridBERT は、事前トレーニングでは BERT よりも優れており、損失が少なく、トレーニング速度が速く (相対 8%)、メモリコストが低く (相対 13%)、さらに下流タスクでの転移学習でも 1.5% 高い精度を示しています。
さらに、DropMask は、さまざまなマスキングレートにわたるダウンストリームタスクの BERT の精度を向上させます。

要約(オリジナル)

Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates.

arxiv情報

著者	Qian Chen,Wen Wang,Qinglin Zhang,Chong Deng,Ma Yukun,Siqi Zheng
発行日	2023-07-14 10:20:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving BERT with Hybrid Pooling Network and Drop Mask

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー