Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

要約

長期的な目標を達成する能力は、大規模な言語モデル（LLMS）の現在の開発における重要な課題です。
これに対処するために、事前に訓練されたLLMを強化学習（RL）で微調整して、特定の目標を最適化するソリューションを探索できます。
ただし、基本的な能力を分解しないように、新しいソリューションを発見することと事前に訓練されたモデルに十分近くにとどまることとの間でバランスをとる必要があるため、LLMSによる探索は困難です。
これは通常、Kullback-Leibler（KL）ペナルティで制御されます。
この論文では、単純な算術タスクに関する小さな言語モデルの探索ダイナミクスを調査します。
トレーニング前の程度の影響がどのようにさまざまな影響を与えるかを示し、最終結果に劇的な影響を与える「重要なトークン」の重要性を示しています。
その結果、重要なトークンでの探査を支持するKLペナルティの簡単な変更を導入し、RL微調整段階の効率を高めます。

要約(オリジナル)

The ability to achieve long-term goals is a key challenge in the current development of large language models (LLMs). To address this, pre-trained LLMs can be fine-tuned with reinforcement learning (RL) to explore solutions that optimize a given goal. However, exploration with LLMs is difficult, as a balance has to be struck between discovering new solutions and staying close enough to the pre-trained model, so as not to degrade basic capabilities. This is typically controlled with a Kullback-Leibler (KL) penalty. In this paper, we investigate the exploration dynamics of a small language model on a simple arithmetic task. We show how varying degrees of pre-training influence exploration and demonstrate the importance of ‘critical tokens’ which have a dramatic impact on the final outcome. Consequently, we introduce a simple modification to the KL penalty that favors exploration on critical tokens, increasing the efficiency of the RL fine-tuning stage.

arxiv情報

著者	Jean Vassoyan,Nathanaël Beau,Roman Plaud
発行日	2025-02-10 14:56:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー