Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits

要約

無限の地平線、平均報酬の落ち着きのない盗賊問題を離散時間で考察します。
私たちは、兵器のサブセットを徐々に拡大して最適な配分に向けて推進するように設計された、新しい種類の政策を提案します。
シングルアームのMDPが最適なシングルアームの下でユニチェーンで非周期的であるという条件で、$N$アームの問題に対して$O(1/\sqrt{N})$の最適性ギャップで私たちのポリシーが漸近的に最適であることを示します。
ポリシー。
私たちのアプローチは、最適値への収束を保証する統一グローバルアトラクタープロパティ (UGAP) に依存するインデックスポリシーや優先順位ポリシー、または同期仮定 (SA) を必要とする最近開発されたシミュレーションベースのポリシーに焦点を当てたほとんどの既存の研究とは異なります。
。

要約(オリジナル)

We consider the infinite-horizon, average-reward restless bandit problem in discrete time. We propose a new class of policies that are designed to drive a progressively larger subset of arms toward the optimal distribution. We show that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality gap for an $N$-armed problem, provided that the single-armed MDP is unichain and aperiodic under the optimal single-armed policy. Our approach departs from most existing work that focuses on index or priority policies, which rely on the Uniform Global Attractor Property (UGAP) to guarantee convergence to the optimum, or a recently developed simulation-based policy, which requires a Synchronization Assumption (SA).

arxiv情報

著者	Yige Hong,Qiaomin Xie,Yudong Chen,Weina Wang
発行日	2024-06-13 17:43:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unichain and Aperiodicity are Sufficient for Asymptotic Optimality of Average-Reward Restless Bandits

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー