On Multi-objective Policy Optimization as a Tool for Reinforcement Learning: Case Studies in Offline RL and Finetuning

要約

深層強化学習 (RL) アルゴリズムの堅牢性と効率性を向上させた多くの進歩は、何らかの形で、ポリシーの最適化ステップに追加の目的や制約を導入したものとして理解できます。
これには、探索ボーナス、エントロピーの正則化、教師や事前データに対する正則化など、幅広いアイデアが含まれます。
多くの場合、タスクの報酬と補助目標は矛盾します。この論文では、このようなケースを多目的 (MO) 最適化問題のインスタンスとして扱うのが自然であると主張します。
この観点により、斬新でより効果的な RL アルゴリズムの開発がどのように可能になるかを示します。
特に、ケーススタディとしてオフライン RL と微調整に焦点を当て、既存のアプローチが線形スカラー化に依存する MO アルゴリズムとして理解できることを示します。
私たちは、線形スカラー化をより優れたアルゴリズムに置き換えることでパフォーマンスを向上できると仮説を立てています。
線形スカラー化よりも優れたパフォーマンスを発揮し、これらの非標準 MO 問題に適用できる新しい MORL アルゴリズムである DiME (Distillation of a Mixture of Experts) を紹介します。
オフライン RL では、DiME が最先端のアルゴリズムを上回るシンプルな新しいアルゴリズムにつながることを実証します。
微調整のために、教師のポリシーを上回るパフォーマンスを学習する新しいアルゴリズムを導き出します。

要約(オリジナル)

Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors. Often, the task reward and auxiliary objectives are in conflict, and in this paper we argue that this makes it natural to treat these cases as instances of multi-objective (MO) optimization problems. We demonstrate how this perspective allows us to develop novel and more effective RL algorithms. In particular, we focus on offline RL and finetuning as case studies, and show that existing approaches can be understood as MO algorithms relying on linear scalarization. We hypothesize that replacing linear scalarization with a better algorithm can improve performance. We introduce Distillation of a Mixture of Experts (DiME), a new MORL algorithm that outperforms linear scalarization and can be applied to these non-standard MO problems. We demonstrate that for offline RL, DiME leads to a simple new algorithm that outperforms state-of-the-art. For finetuning, we derive new algorithms that learn to outperform the teacher policy.

arxiv情報

著者	Abbas Abdolmaleki,Sandy H. Huang,Giulia Vezzani,Bobak Shahriari,Jost Tobias Springenberg,Shruti Mishra,Dhruva TB,Arunkumar Byravan,Konstantinos Bousmalis,Andras Gyorgy,Csaba Szepesvari,Raia Hadsell,Nicolas Heess,Martin Riedmiller
発行日	2023-08-01 12:02:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On Multi-objective Policy Optimization as a Tool for Reinforcement Learning: Case Studies in Offline RL and Finetuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー