Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

要約

私たちは、一般的な事前トレーニングと微調整の間に位置する、タスクに依存しない新しいドメイン内事前トレーニング方法を提案します。
私たちのアプローチは、ドメイン内キーワード、つまりターゲットドメインをコンパクトに表現する単語を選択的にマスクします。
私たちは、KeyBERT (Grootendorst、2020) を使用してそのようなキーワードを特定します。
6 つの異なる設定 (3 つのデータセットと 2 つの異なる事前トレーニング済み言語モデル (PLM) を組み合わせたもの) を使用してアプローチを評価します。
私たちの結果は、ドメイン内事前トレーニング戦略を使用して適応させた微調整された PLM が、ランダムマスキングを使用したドメイン内事前トレーニングを使用した PLM や、事前にトレーニングしてから微調整する一般的なパラダイムに従った PLM よりも優れていることを明らかにしました。
。
さらに、ドメイン内キーワードを識別するオーバーヘッドは妥当であり、たとえば、BERT Large の事前トレーニング時間 (2 エポックの場合) の 7 ～ 15% です (Devlin et al., 2019)。

要約(オリジナル)

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

arxiv情報

著者	Shahriar Golchin,Mihai Surdeanu,Nazgol Tavabi,Ata Kiapour
発行日	2023-07-14 05:09:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー