Dense Reward for Free in Reinforcement Learning from Human Feedback

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、大規模言語モデル (LLM) が効果的に指示に従い、有用な支援を生み出すことを可能にした重要な進歩として認められています。
従来、これには、別の報酬モデルを使用して完全な補完にスコアを割り当てる前に、クエリに応じて LLM から補完を生成することが含まれます。
自己回帰プロセスとして、LLM は多くの「アクション」 (個々のトークンの選択) を実行する必要があり、エピソードの終わりに 1 つのまばらな報酬しか受け取りません。これは、従来の強化学習では最適化が難しいことが知られている設定です。
。
この研究では、報酬モデルにはスカラー出力だけではなく、より多くの情報が含まれているという事実を利用します。特に、トランスフォーマーアーキテクチャの一部としてトークンに対するアテンションマップを計算します。
これらの注意の重みを使用して、追加の計算コストをかけたり追加のモデリングを必要とせずに、信号を効果的に高密度化し、最も重要なトークンを強調表示しながら、完了全体に沿って報酬を再分配します。
理論的には、このアプローチは潜在的なベースの報酬形成と同等であり、最適なポリシーが変更されないことを保証することを実証します。
経験的に、それがトレーニングを安定させ、学習速度を加速し、実際のケースではより良い局所最適化につながる可能性があることを示しています。

要約(オリジナル)

Reinforcement Learning from Human Feedback (RLHF) has been credited as the key advance that has allowed Large Language Models (LLMs) to effectively follow instructions and produce useful assistance. Classically, this involves generating completions from the LLM in response to a query before using a separate reward model to assign a score to the full completion. As an auto-regressive process, the LLM has to take many ‘actions’ (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. In this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. We use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. We demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.

arxiv情報

著者	Alex J. Chan,Hao Sun,Samuel Holt,Mihaela van der Schaar
発行日	2024-02-01 17:10:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dense Reward for Free in Reinforcement Learning from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー