The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

要約

このペーパーでは、非常に小さなデータセットに事前に訓練された大型言語モデル（LLM）に過剰適合するという直感に反する一般化結果を紹介します。
自由回答形式のテキスト生成の設定では、LLMが繰り返しの鈍いシーケンスを生成する傾向があることは十分に文書化されています。これは、貪欲なデコードを使用して生成するときに特に明らかな現象です。
この問題は、数十億のパラメーターを含む最先端のLLMSでも持続し、大規模なデータセットの次のトークン予測を介して訓練されています。
これらのモデルをさらに微調整して、小さなサンプルのセット（ハイパーフィッティングと呼ばれるプロセス）でゼロに近いトレーニング損失を達成することで、長いシーケンス生成機能が大幅に強化されることがわかります。
これらの過激なモデルを使用した貪欲なデコードは、多様性と人間の好みの両方の点で、長いシーケンスよりもトップPサンプリングよりも優れています。
この現象は、さまざまなサイズ、さまざまなドメイン、さらには自己回帰画像生成のLLMSに拡張されます。
さらに、この現象は、グラッキングと二重降下の現象とはっきりと異なることがわかります。
驚くべきことに、我々の実験は、過激なモデルが訓練された繰り返しシーケンスにめったに該当することはめったにないことを示しており、これらのシーケンスを明示的にブロックすることで、高品質の出力になります。
すべての過熱モデルは、非常に低いエントロピーの予測を生成し、多くの場合、ほぼすべての確率を単一のトークンに割り当てます。

要約(オリジナル)

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples — a process we refer to as hyperfitting — the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

arxiv情報

著者	Fredrik Carlsson,Fangyu Liu,Daniel Ward,Murathan Kurfali,Joakim Nivre
発行日	2025-02-26 17:51:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー