Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

要約

マルチエージェントの相互作用は、強化学習のコンテキストでますます重要になり、ポリシー勾配法の理論的基礎は、急増する研究の関心を集めています。
マルチエージェント学習における自然方策勾配 (NPG) アルゴリズムのグローバル収束を調査します。
最初に、コストが正則化されている場合でも、標準の NPG にはパラメーター収束、つまり、ポリシーをパラメーター化するベクトルの収束がない可能性があることを示します (これにより、文献のポリシー空間で強力な収束が保証されます)。
このパラメータの非収束は、学習の安定性の問題につながります。これは、高次元のポリシーではなく、低次元のパラメータでしか操作できない関数近似の設定で特に関係します。
次に、いくつかの標準的なマルチエージェント学習シナリオ用の NPG アルゴリズムのバリアントを提案します: 2 プレイヤーのゼロ和行列とマルコフゲーム、およびマルチプレイヤーの単調ゲームで、グローバルな最終反復パラメーターの収束が保証されます。
また、結果を特定の関数近似設定に一般化します。
私たちのアルゴリズムでは、エージェントは対称的な役割を果たしていることに注意してください。
私たちの結果は、特定の構造を持つ非凸非凹ミニマックス最適化問題を解決するためにも、独立した興味深いものになる可能性があります。
シミュレーションも提供されており、理論的発見を裏付けています。

要約(オリジナル)

Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.

arxiv情報

著者	Sarath Pattathil,Kaiqing Zhang,Asuman Ozdaglar
発行日	2023-03-20 13:56:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー