Implicit Optimization Bias of Next-Token Prediction in Linear Models

要約

私たちは、現代の言語モデルの主要なトレーニングパラダイムであるネクストトークン予測 (NTP) の最適化特性の調査を開始します。
具体的には、NTP 目標の多数の可能なミニマイザーの中から勾配ベースのオプティマイザーによって選択されたソリューションの構造特性を研究します。
NTP を、トークンの有限語彙にわたるスパースの条件付き確率分布と結び付けられた、異なるコンテキストにわたるクロスエントロピー最小化としてフレーム化することにより、データエントロピーの下限に到達することを可能にする「NTP 分離可能条件」を導入します。
この設定で、固定コンテキスト埋め込みを持つ線形モデルに焦点を当てて、勾配降下法 (GD) の最適化バイアスを特徴付けます。異なるコンテキストのスパースパターンによって定義されるデータ部分空間内で、GD は、次のロジットの差に相当するパラメーターを選択します。
-対数オッズに対するトークンをサポートします。
直交部分空間では、GD パラメーターはノルム内で発散し、NTP に固有のマージンを最大化する方向を選択します。
これらの発見は、ワンホット分類における暗黙的なバイアスに関する以前の研究を NTP 設定に拡張し、重要な違いを強調し、コンテキスト埋め込みの生成に使用される特定のアーキテクチャに関係なく、NTP の最適化および一般化特性についてのさらなる研究を促進します。

要約(オリジナル)

We initiate an investigation into the optimization properties of next-token prediction (NTP), the dominant training paradigm for modern language models. Specifically, we study the structural properties of the solutions selected by gradient-based optimizers among the many possible minimizers of the NTP objective. By framing NTP as cross-entropy minimization across distinct contexts, each tied with a sparse conditional probability distribution across a finite vocabulary of tokens, we introduce ‘NTP-separability conditions’ that enable reaching the data-entropy lower bound. With this setup, and focusing on linear models with fixed context embeddings, we characterize the optimization bias of gradient descent (GD): Within the data subspace defined by the sparsity patterns of distinct contexts, GD selects parameters that equate the logits’ differences of in-support tokens to their log-odds. In the orthogonal subspace, the GD parameters diverge in norm and select the direction that maximizes a margin specific to NTP. These findings extend previous research on implicit bias in one-hot classification to the NTP setting, highlighting key differences and prompting further research into the optimization and generalization properties of NTP, irrespective of the specific architecture used to generate the context embeddings.

arxiv情報

著者	Christos Thrampoulidis
発行日	2024-10-31 17:01:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Implicit Optimization Bias of Next-Token Prediction in Linear Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー