Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

要約

近接ポリシー最適化 (PPO) は、大規模言語モデル (LLM) を下流タスクと調整するために、ヒューマンフィードバックからの強化学習で一般的に使用されます。
この論文では、中間報酬モデルを介した人間のフィードバックからの間接学習ではなく、明示的にプログラムされた報酬信号からの直接強化学習 (RL) に PPO を使用する実現可能性を調査します。
私たちは、生成された出力の品質を自動的に評価するために明示的な報酬関数をプログラムできる、数学やプログラミングなどの形式言語を通じて表現されるタスクに焦点を当てています。
このアプローチを感情調整タスク、単純な算術タスク、およびより複雑なゲーム合成タスクに適用します。
感情調整タスクは以前の研究を再現し、実験設定を検証するのに役立ちます。
私たちの結果は、2 つの形式言語タスクに対する純粋な RL ベースのトレーニングは困難であり、単純な算術タスクであっても成功は限られていることを示しています。
トレーニングはまだ完全には安定していませんが、探索を支援するために新しいバッチエントロピー正則化項を提案します。
私たちの調査結果は、たとえ有益な報酬シグナルをプログラムで表現できる場合でも、LLM の直接 RL トレーニングは、新しいタスクを完全に学習するよりも、アライメントなどの比較的小さな変更に適している可能性があることを示唆しています。

要約(オリジナル)

Proximal Policy Optimization (PPO) is commonly used in Reinforcement Learning from Human Feedback to align large language models (LLMs) with downstream tasks. This paper investigates the feasibility of using PPO for direct reinforcement learning (RL) from explicitly programmed reward signals, as opposed to indirect learning from human feedback via an intermediary reward model. We focus on tasks expressed through formal languages, such as mathematics and programming, where explicit reward functions can be programmed to automatically assess the quality of generated outputs. We apply this approach to a sentiment alignment task, a simple arithmetic task, and a more complex game synthesis task. The sentiment alignment task replicates prior research and serves to validate our experimental setup. Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task. We propose a novel batch-entropy regularization term to aid exploration, although training is not yet entirely stable. Our findings suggest that direct RL training of LLMs may be more suitable for relatively minor changes, such as alignment, than for learning new tasks altogether, even if an informative reward signal can be expressed programmatically.

arxiv情報

著者	Alexander G. Padula,Dennis J. N. J. Soemers
発行日	2024-10-22 15:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー