Analysing The Impact of Sequence Composition on Language Model Pre-Training

要約

ほとんどの言語モデルの事前トレーニングフレームワークは、複数のドキュメントを固定長のシーケンスに連結し、因果関係マスキングを使用してコンテキストを考慮した各トークンの尤度を計算します。
この戦略は、そのシンプルさと効率性により広く採用されています。
ただし、今日に至るまで、モデルの一般化特性に対する事前トレーニングシーケンス構成戦略の影響は十分に調査されていません。
今回の研究では、因果マスキングを適用すると、事前トレーニング中に以前の文書から気が散る情報が含まれる可能性があり、言語モデリングや下流のタスクにおけるモデルのパフォーマンスに悪影響を及ぼす可能性があることがわかりました。
文書内の因果マスキングでは、各トークンの可能性は同じ文書内の前のトークンにのみ条件付けされ、以前の文書から気が散る可能性のある情報が排除され、パフォーマンスが大幅に向上します。
さらに、関連ドキュメントを連結することで、事前トレーニング中の潜在的な気を散らす可能性を減らすことができ、提案した効率的な検索ベースのシーケンス構築方法である BM25Chunk により、コンテキスト内の学習 (+11.6\%) と知識の記憶 (+9.8\) が向上することがわかりました。
%)、効率を犠牲にすることなく言語モデルのコンテキスト利用 (+7.2\%) 能力を向上させます。

要約(オリジナル)

Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of language models without sacrificing efficiency.

arxiv情報

著者	Yu Zhao,Yuanbin Qu,Konrad Staniszewski,Szymon Tworkowski,Wei Liu,Piotr Miłoś,Yuxiang Wu,Pasquale Minervini
発行日	2024-02-21 18:23:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Analysing The Impact of Sequence Composition on Language Model Pre-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー