Multi-turn Reinforcement Learning from Preference Human Feedback

要約

人間のフィードバックからの強化学習 (RLHF) は、大規模言語モデル (LLM) を人間の好みに合わせるための標準的なアプローチとなり、LLM がさまざまなタスクで優れた能力を発揮できるようになりました。
既存の方法は、単一の決定 (ターン) レベルで設定をエミュレートすることによって機能し、長期的な目標を達成するために計画や複数ターンの対話が必要な設定では機能が制限されます。
この論文では、2 つの完全なマルチターン会話間の嗜好フィードバックから強化学習 (RL) の新しい方法を開発することで、この問題に対処します。
表形式の設定では、一般的なマルチターン優先ベースの RL 問題に対する新しいミラー降下ベースのポリシー最適化アルゴリズムを提示し、そのナッシュ均衡への収束を証明します。
パフォーマンスを評価するために、教師エージェントが生徒にランダムなトピックの学習を指導する新しい環境「Education Dialogue」を作成し、アルゴリズムのディープ RL バリアントが RLHF ベースラインを上回るパフォーマンスを示すことを示します。
最後に、明示的な報酬がある環境では、より弱い優先信号のみに依存しているにもかかわらず、アルゴリズムが報酬ベースの RL ベースラインと同じパフォーマンスを回復することを示します。

要約(オリジナル)

Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to achieve a long-term goal. In this paper, we address this issue by developing novel methods for Reinforcement Learning (RL) from preference feedback between two full multi-turn conversations. In the tabular setting, we present a novel mirror-descent-based policy optimization algorithm for the general multi-turn preference-based RL problem, and prove its convergence to Nash equilibrium. To evaluate performance, we create a new environment, Education Dialogue, where a teacher agent guides a student in learning a random topic, and show that a deep RL variant of our algorithm outperforms RLHF baselines. Finally, we show that in an environment with explicit rewards, our algorithm recovers the same performance as a reward-based RL baseline, despite relying solely on a weaker preference signal.

arxiv情報

著者	Lior Shani,Aviv Rosenberg,Asaf Cassel,Oran Lang,Daniele Calandriello,Avital Zipori,Hila Noga,Orgad Keller,Bilal Piot,Idan Szpektor,Avinatan Hassidim,Yossi Matias,Rémi Munos
発行日	2024-12-02 12:37:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-turn Reinforcement Learning from Preference Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー