Do Transformers Parse while Predicting the Masked Word?

要約

事前に訓練された言語モデルは、言語構造をエンコードすることが示されています。
マスクされた言語モデリングのような教師なし損失関数でトレーニングされている間、依存関係と構成要素の解析ツリーを埋め込みます。
モデルが実際に解析を行っているのか、それとも弱い相関関係にある一部の計算のみを行っているのか、いくつかの疑問が提起されています。
私たちは次の質問を研究します: (a) 現実的な埋め込み次元、ヘッドの数などを持つ変換器を明示的に記述することは可能ですか?
(b) 事前トレーニング済みのモデルが解析構造をキャプチャするのはなぜですか?
このホワイトペーパーでは、PCFG を使用したジェネレーティブモデリングのコンテキストで、これらの質問に答えるための一歩を踏み出します。
適度なサイズの BERT や RoBERTa のようなマスクされた言語モデルが、英語の PCFG の Inside-Outside アルゴリズムをほぼ実行できることを示します [Marcus et al, 1993]。
また、Inside-Outside アルゴリズムが、PCFG で生成されたデータのマスクされた言語モデリングの損失に最適であることも示します。
また、平均で $50$ レイヤー、$15$ アテンションヘッド、$1275$ 次元埋め込みを備えた変換器の構築を行い、その埋め込みを使用して、PTB データセットで $>70\%$ F1 スコアで構成要素解析を実行できるようにします。
PCFG で生成されたデータで事前にトレーニングされたモデルでプロービング実験を行い、これが近似解析ツリーの回復を可能にするだけでなく、Inside-Outside アルゴリズムによって計算された限界スパン確率も回復することを示します。これは、マスクされた言語モデリングの暗黙のバイアスを示唆しています。
このアルゴリズムに向かって。

要約(オリジナル)

Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing — or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.

arxiv情報

著者	Haoyu Zhao,Abhishek Panigrahi,Rong Ge,Sanjeev Arora
発行日	2023-03-14 17:49:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Transformers Parse while Predicting the Masked Word?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー