Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

要約

組み合わせセミバンディット問題の一般的なケースである$ m $ $ -Set Semi-Banditを検討します。ここでは、学習者は合計$ D $アームから$ M $アームを正確に選択します。
敵対的な設定では、$ \ mathcal {o}（\ sqrt {nmd}）$ for time Horizon $ n $であることが知られている最高の後悔の拘束は、よく知られている次の正規化されたリーダー（FTRL）ポリシーによって達成されます。
ただし、これには、各タイムステップで問題を最適化し、それらに従ってサンプルをサンプリングすることにより、アーム選択確率を明示的に計算する必要があります。
この問題は、後回きのあるリーダー（FTPL）ポリシーによって回避できます。
この論文では、FR \ ‘Echet摂動を備えたFTPLは、敵対的な設定で最適な後悔の$ \ mathcal {o}（\ sqrt {nmd \ log（d）}）$を享受し、最良のworld後悔の境界に近づくことも享受していることを示します。

要約(オリジナル)

We consider a common case of the combinatorial semi-bandit problem, the $m$-set semi-bandit, where the learner exactly selects $m$ arms from the total $d$ arms. In the adversarial setting, the best regret bound, known to be $\mathcal{O}(\sqrt{nmd})$ for time horizon $n$, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them. This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that rank among the $m$ smallest (estimated) loss with random perturbation. In this paper, we show that FTPL with a Fr\’echet perturbation also enjoys the near optimal regret bound $\mathcal{O}(\sqrt{nmd\log(d)})$ in the adversarial setting and approaches best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting.

arxiv情報

著者	Jingxin Zhan,Yuchen Xin,Zhihua Zhang
発行日	2025-04-22 15:16:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー