Revisiting the Hypothesis: Do pretrained Transformers Learn In-Context by Gradient Descent?

要約

LLM における In-Context Learning (ICL) の出現は、依然として重要な現象ですが、ほとんど理解されていません。
ICL を説明するために、最近の研究では理論的に ICL を勾配降下法 (GD) に関連付けようとしています。
この接続は実際の事前トレーニング済みモデルでも維持できるのでしょうか?
我々は、言語モデルがトレーニングされる実際のコンテキストとそのコンテキストをかなり異なるものにする、以前の研究における限定的な仮定を強調します。
たとえば、これらの研究で使用される理論上の手作業で構築された重みには、実際の LLM の特性と一致しない特性があります。
さらに、彼らの実験的検証では、ICL の目的 (ICL 用に明示的にモデルをトレーニングする) が使用されます。これは、野生の創発 ICL とは異なります。
実際のモデルからも証拠を探します。
ICL と GD はデモンストレーションを観察する順序に対して異なる感度を持っていることがわかりました。
最後に、自然環境における ICL と GD の仮説を調査し、比較します。
私たちは、自然データ (LLaMa-7B) で事前トレーニングされた言語モデルに対して包括的な実証分析を実行します。
3 つのパフォーマンス指標を比較すると、データセット、モデル、デモンストレーションの数などのさまざまな要因の関数として、ICL と GD の一貫性のない動作が浮き彫りになります。
ICL と GD が言語モデルの出力分布を異なる方法で変更することがわかります。
これらの結果は、ICL と GD の同等性が依然として未解決の仮説であり、さらなる研究が必要であることを示しています。

要約(オリジナル)

The emergence of In-Context Learning (ICL) in LLMs remains a significant phenomenon with little understanding. To explain ICL, recent studies try to theoretically connect it to Gradient Descent (GD). We ask, does this connection hold up in actual pre-trained models? We highlight the limiting assumptions in prior works that make their context considerably different from the practical context in which language models are trained. For example, the theoretical hand-constructed weights used in these studies have properties that don’t match those of real LLMs. Furthermore, their experimental verification uses ICL objective (training models explicitly for ICL), which differs from the emergent ICL in the wild. We also look for evidence in real models. We observe that ICL and GD have different sensitivity to the order in which they observe demonstrations. Finally, we probe and compare the ICL vs. GD hypothesis in a natural setting. We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLaMa-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. We observe that ICL and GD modify the output distribution of language models differently. These results indicate that the equivalence between ICL and GD remains an open hypothesis and calls for further studies.

arxiv情報

著者	Lingfeng Shen,Aayush Mishra,Daniel Khashabi
発行日	2024-02-29 18:47:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting the Hypothesis: Do pretrained Transformers Learn In-Context by Gradient Descent?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー