$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

要約

協力的なマルチエージェント補強学習（MARL）では、グローバルな報酬がまばらで遅れている場合、効果的なポリシーを学習することは困難です。
この困難は、エージェントとタイムステップの両方にクレジットを割り当てる必要性から生じます。これは、既存の方法がエピソードで長老のタスクで対処できないことが多い問題です。
一時的な報酬再分配$ tar^2 $を提案します。これは、エージェント固有のタイムステップ固有のコンポーネントにまばらなグローバルな報酬を分解する新しいアプローチを提案します。
理論的には、$ tar^2 $（i）が潜在的な報酬形状に合わせて、元の環境と同じ最適なポリシーを維持することを示し、（ii）元のスパース報酬の下にあるものと同一のポリシーグラデーションの更新方向を維持し、確実にします。
公平なクレジット信号。
2つの挑戦的なベンチマークであるSmacliteとGoogle Research Footballの経験的結果は、$ tar^2 $が収束を大幅に安定させ、加速し、学習速度と最終パフォーマンスの両方でArelやSTAのような強力なベースラインを上回ることを示しています。
これらの調査結果は、まばらな報酬マルチエージェントシステムにおけるエージェントと同時の信用割り当ての原則的かつ実用的なソリューションとして$ tar^2 $を確立します。

要約(オリジナル)

In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

arxiv情報

著者	Aditya Kapoor,Kale-ab Tessera,Mayank Baranwal,Harshad Khadilkar,Stefano Albrecht,Mingfei Sun
発行日	2025-02-07 12:07:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー