Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

要約

オフラインマルチエージェント強化学習 (MARL) は、事前に収集されたデータセットから最適なマルチエージェントポリシーを学習することを目的とした新興分野です。
単一エージェントの場合と比較して、マルチエージェント設定には、大規模な共同状態アクション空間と複数のエージェントの結合された動作が含まれるため、オフラインポリシーの最適化がさらに複雑になります。
この研究では、既存のオフライン MARL 手法を再検討し、特定のシナリオでは問題が発生し、調整されていない動作や配布外 (OOD) 共同アクションにつながる可能性があることを示します。
これらの問題に対処するために、In-Sample Sequential Policy Optimization (InSPO) という名前の新しいオフライン MARL アルゴリズムを提案します。
InSPO は、サンプル内の方法で各エージェントのポリシーを順次更新します。これにより、OOD 共同アクションの選択が回避されるだけでなく、チームメイトの更新されたポリシーを慎重に考慮して調整が強化されます。
さらに、InSPO は、動作ポリシーで確率の低いアクションを徹底的に調査することで、最適ではない解決策への時期尚早な収束の問題に十分に対処できます。
理論的には、InSPO が単調な政策改善を保証し、量子応答均衡 (QRE) に収束することを証明します。
実験結果は、現在の最先端のオフライン MARL 手法と比較して、私たちの手法の有効性を示しています。

要約(オリジナル)

Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent’s policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates’ updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

arxiv情報

著者	Zongkai Liu,Qian Lin,Chao Yu,Xiawei Wu,Yile Liang,Donghui Li,Xuetao Ding
発行日	2024-12-10 16:19:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー