Learning mirror maps in policy mirror descent

要約

Policy Mirror Descent (PMD) は強化学習で人気のフレームワークであり、多数のアルゴリズムを網羅する統一的な観点として機能します。
これらのアルゴリズムはミラーマップの選択を通じて導出され、有限時間収束保証を享受します。
その人気にもかかわらず、PMD の可能性を最大限に引き出す研究は限られており、研究の大部分は、有名な Natural Policy Gradient (NPG) 手法を生み出す特定のミラーマップ、つまり負のエントロピーに焦点を当てています。
ミラーマップの選択が PMD の有効性に大きく影響するかどうかは、既存の理論研究からは不明のままです。
私たちの研究では、従来のミラーマップの選択 (NPG) が、いくつかの標準ベンチマーク環境全体で最適とは言えない結果をもたらすことが多いことを示す実証調査を行っています。
進化的戦略を使用して、PMD のパフォーマンスを向上させる、より効率的なミラーマップを特定します。
まず、表形式の環境、つまり Grid-World に焦点を当てます。そこでは、既存の理論的境界を、いくつかの標準ミラーマップと学習されたマップの PMD のパフォーマンスに関連付けます。
次に、MinAtar スイートなどのより複雑な環境で、負のエントロピーを上回るパフォーマンスのミラーマップを学習できることを示します。
私たちの結果は、ミラーマップがさまざまな環境にわたってうまく一般化できることを示唆しており、ミラーマップを環境の構造と特性に最適に一致させる方法について疑問が生じています。

要約(オリジナル)

Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD’s full potential is limited, with the majority of research focusing on a particular mirror map — namely, the negative entropy — which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD’s efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Our results suggest that mirror maps generalize well across various environments, raising questions about how to best match a mirror map to an environment’s structure and characteristics.

arxiv情報

著者	Carlo Alfano,Sebastian Towers,Silvia Sapora,Chris Lu,Patrick Rebeschini
発行日	2024-06-07 16:08:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning mirror maps in policy mirror descent

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー