Do pretrained Transformers Really Learn In-context by Gradient Descent?

要約

インコンテキスト学習 (ICL) は暗黙的に勾配降下法 (GD) と同等ですか?
最近のいくつかの研究では、GD のダイナミクスと大規模な言語モデルにおける ICL の新たな動作の間の類似性を示しています。
ただし、これらの研究では、言語モデルがトレーニングされる現実的な自然言語設定とはかけ離れた仮定を立てています。
したがって、理論と実践の間のこのような矛盾は、その適用可能性を検証するためにさらなる調査を必要とします。
まず、勾配降下法をシミュレートするために Transformer の重みを構築する以前の研究の弱点を強調します。
ICL の目的に基づいてトランスフォーマーを訓練する彼らの実験、ICL と GD の次数感度の不一致、構築された重みのスパース性、パラメータ変更に対する感度は、現実世界の設定との不一致の例です。
さらに、自然環境における ICL と GD の仮説を調査し、比較します。
私たちは、自然データ (LLaMa-7B) で事前学習された言語モデルの包括的な実証分析を実行します。
さまざまなパフォーマンス指標を比較すると、データセット、モデル、デモンストレーションの数などのさまざまな要因の関数として、ICL と GD の動作が一貫していないことが浮き彫りになります。
ICL と GD が言語モデルの出力分布を異なる方法で適応させていることがわかります。
これらの結果は、ICL と GD の同等性は未解決の仮説であり、微妙な考慮が必要であり、さらなる研究が必要であることを示しています。

要約(オリジナル)

Is In-Context Learning (ICL) implicitly equivalent to Gradient Descent (GD)? Several recent works draw analogies between the dynamics of GD and the emergent behavior of ICL in large language models. However, these works make assumptions far from the realistic natural language setting in which language models are trained. Such discrepancies between theory and practice, therefore, necessitate further investigation to validate their applicability. We start by highlighting the weaknesses in prior works that construct Transformer weights to simulate gradient descent. Their experiments with training Transformers on ICL objective, inconsistencies in the order sensitivity of ICL and GD, sparsity of the constructed weights, and sensitivity to parameter changes are some examples of a mismatch from the real-world setting. Furthermore, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pretrained on natural data (LLaMa-7B). Our comparisons on various performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and number of demonstrations. We observe that ICL and GD adapt the output distribution of language models differently. These results indicate that the equivalence between ICL and GD is an open hypothesis, requires nuanced considerations and calls for further studies.

arxiv情報

著者	Lingfeng Shen,Aayush Mishra,Daniel Khashabi
発行日	2023-10-12 17:32:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do pretrained Transformers Really Learn In-context by Gradient Descent?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー