A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks

要約

この論文では、主要な深部補強学習方法の新しいバリエーションである近位政策最適化（PPO）である決定後の近位政策最適化（PDPPO）を紹介します。
PDPPO状態の遷移プロセスは、2つのステップに分けられます。決定論的なステップは、決定後の状態と次の状態につながる確率的ステップをもたらします。
私たちのアプローチには、問題の次元を減らし、値関数の推定の精度を高めるために、決定後の状態と二重批評家が組み込まれています。
ロットサイジングは、このようなダイナミクスを例示する混合整数プログラミングの問題です。
ロットサイジングの目的は、不確実な需要とコストパラメーターの生産、配信の履行、在庫レベルを最適化することです。
このペーパーでは、さまざまな環境と構成にわたるPDPPOのパフォーマンスを評価します。
特に、デュアル批評家アーキテクチャを持つPDPPOは、特定のシナリオでバニラPPOの最大報酬をほぼ2倍にし、エピソードの反復が少なくなり、異なる初期化にわたってより速くより一貫した学習を実証する必要があります。
平均して、PDPPOは、状態移行に確率的成分を持つ環境でPPOを上回ります。
これらの結果は、決定後の状態を使用することの利点をサポートしています。
値関数近似にこの決定後の状態を統合すると、高次元および確率的環境でより多くの情報に基づいた効率的な学習につながります。

要約(オリジナル)

This paper presents Post-Decision Proximal Policy Optimization (PDPPO), a novel variation of the leading deep reinforcement learning method, Proximal Policy Optimization (PPO). The PDPPO state transition process is divided into two steps: a deterministic step resulting in the post-decision state and a stochastic step leading to the next state. Our approach incorporates post-decision states and dual critics to reduce the problem’s dimensionality and enhance the accuracy of value function estimation. Lot-sizing is a mixed integer programming problem for which we exemplify such dynamics. The objective of lot-sizing is to optimize production, delivery fulfillment, and inventory levels in uncertain demand and cost parameters. This paper evaluates the performance of PDPPO across various environments and configurations. Notably, PDPPO with a dual critic architecture achieves nearly double the maximum reward of vanilla PPO in specific scenarios, requiring fewer episode iterations and demonstrating faster and more consistent learning across different initializations. On average, PDPPO outperforms PPO in environments with a stochastic component in the state transition. These results support the benefits of using a post-decision state. Integrating this post-decision state in the value function approximation leads to more informed and efficient learning in high-dimensional and stochastic environments.

arxiv情報

著者	Leonardo Kanashiro Felizardo,Edoardo Fadda,Paolo Brandimarte,Emilio Del-Moral-Hernandez,Mariá Cristina Vasconcelos Nascimento
発行日	2025-04-07 14:56:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー