Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

要約

バンディットおよび表形式のマルコフ決定プロセス (MDP) に対する (確率的) ソフトマックスポリシー勾配 (PG) 法を検討します。
PG 対物レンズは非凹面ですが、最近の研究では、対物レンズの滑らかさと勾配支配特性を使用して、最適なポリシーへの収束を達成しています。
ただし、これらの理論的な結果では、問題に依存する未知の量 (たとえば、バンディット問題における最適なアクションや真の報酬ベクトル) に従ってアルゴリズムパラメーターを設定する必要があります。
この問題に対処するために、最適化文献からアイデアを借用して、正確な設定と確率的な設定の両方で実用的で原則に基づいた PG メソッドを設計します。
正確な設定では、Armijo ラインサーチを使用してソフトマックス PG のステップサイズを設定し、線形収束率を示します。
確率的設定では、指数関数的に減少するステップサイズを利用し、結果として得られるアルゴリズムの収束率を特徴付けます。
提案されたアルゴリズムは、最先端の結果と同様の理論的保証を提供しますが、神託のような量の知識を必要としないことを示します。
マルチアームバンディット設定の場合、私たちの技術は、明示的な探索、報酬ギャップ、報酬分布、またはノイズの知識を必要としない、理論的に原則に基づいた PG アルゴリズムをもたらします。
最後に、提案された方法を、オラクルの知識を必要とする PG アプローチと経験的に比較し、競合するパフォーマンスを実証します。

要約(オリジナル)

We consider (stochastic) softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). While the PG objective is non-concave, recent research has used the objective’s smoothness and gradient domination properties to achieve convergence to an optimal policy. However, these theoretical results require setting the algorithm parameters according to unknown problem-dependent quantities (e.g. the optimal action or the true reward vector in a bandit problem). To address this issue, we borrow ideas from the optimization literature to design practical, principled PG methods in both the exact and stochastic settings. In the exact setting, we employ an Armijo line-search to set the step-size for softmax PG and demonstrate a linear convergence rate. In the stochastic setting, we utilize exponentially decreasing step-sizes, and characterize the convergence rate of the resulting algorithm. We show that the proposed algorithm offers similar theoretical guarantees as the state-of-the art results, but does not require the knowledge of oracle-like quantities. For the multi-armed bandit setting, our techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise. Finally, we empirically compare the proposed methods to PG approaches that require oracle knowledge, and demonstrate competitive performance.

arxiv情報

著者	Michael Lu,Matin Aghaei,Anant Raj,Sharan Vaswani
発行日	2024-07-09 16:59:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー