Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

要約

数十億のパラメータを持つ大規模言語モデル (LLM) は、シーケンス内の次のトークンの予測に優れています。
最近の研究では、LLM の非空の圧縮ベースの一般化境界を計算していますが、これらの境界は 10 億パラメータ規模の大規模モデルでは空です。
さらに、これらの境界は、低品質のテキストを生成する圧縮モデルを制限する制限的な圧縮技術によって取得されます。
さらに、これらの既存の境界の厳しさは、はるかに多数の非 IID 構成トークンではなく、トレーニングセット内の IID ドキュメントの数に依存するため、より厳しい境界の可能性が未開発のまま残されています。
この研究では、代わりにマーチンゲールのプロパティを使用して、LLM トレーニングセット内の膨大な数のトークンから恩恵を受ける一般化限界を導き出します。
データセットにはドキュメントよりもはるかに多くのトークンが含まれているため、一般化限界は許容されるだけでなく、はるかに制限の少ない圧縮スキームから実際に恩恵を受けます。
モナーク行列、クロネッカー分解、およびトレーニング後の量子化を使用して、LLaMA2-70B ほどの大きさの LLM に対して非空の汎化限界を達成します。
以前のアプローチとは異なり、私たちの研究は、実際に展開され、高品質のテキストを生成するモデルの最初の非空の境界を達成します。

要約(オリジナル)

Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually benefit from far less restrictive compression schemes. With Monarch matrices, Kronecker factorizations, and post-training quantization, we achieve non-vacuous generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous approaches, our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.

arxiv情報

著者	Sanae Lotfi,Yilun Kuang,Brandon Amos,Micah Goldblum,Marc Finzi,Andrew Gordon Wilson
発行日	2024-07-25 16:13:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー