The Power of Populations in Decentralized Bandits

要約

私たちは、分散型 GOSSIP モデルにおける協力的なマルチエージェントバンディット設定を研究します。各ラウンドで、各 $n$ エージェントは共通のセットからアクションを選択し、そのアクションに対応する報酬を観察し、その後、ランダムに選択された単一のネイバーと情報を交換します。
これにより、次のラウンドでの方針が決まります。
各エージェントが一定のメモリのみを持つという制約の下で、この設定で完全に分散化されたローカルアルゴリズムのいくつかのファミリーを導入して分析します。
我々は、このような分散型アルゴリズムの世界的な進化と新しいクラスの「ゼロサム」乗法重み更新手法との関係を強調し、これらの自然なプロトコルの集団レベルの後悔を分析するための一般的なフレームワークを開発します。
このフレームワークを使用して、定常報酬設定と敵対的報酬設定の両方について線形未満のリグレス限界を導き出します。
さらに、報酬分布が確率的勾配オラクルから生成されると仮定して、これらの単純なローカルアルゴリズムが単体に対して凸関数を近似的に最適化できることを示します。

要約(オリジナル)

We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action’s corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which informs its policy in the next round. We introduce and analyze several families of fully-decentralized local algorithms in this setting under the constraint that each agent has only constant memory. We highlight a connection between the global evolution of such decentralized algorithms and a new class of ‘zero-sum’ multiplicative weights update methods, and we develop a general framework for analyzing the population-level regret of these natural protocols. Using this framework, we derive sublinear regret bounds for both stationary and adversarial reward settings. Moreover, we show that these simple local algorithms can approximately optimize convex functions over the simplex, assuming that the reward distributions are generated from a stochastic gradient oracle.

arxiv情報

著者	John Lazarsfeld,Dan Alistarh
発行日	2024-02-01 18:54:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Power of Populations in Decentralized Bandits

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー