Multi-armed Bandit Learning on a Graph

要約

多腕バンディット (MAB) 問題は、不確実性の下での意思決定のコンテキストで広く研究されてきたシンプルでありながら強力なフレームワークです。
ロボットアプリケーションなどの多くの実世界のアプリケーションでは、アームの選択は、次に利用可能なアーム (アクション) の選択を制限する物理的なアクションに対応します。
これに動機付けられて、エージェントがグラフ上を移動してさまざまなノードから収集された報酬を最大化する、グラフバンディットと呼ばれる MAB の拡張を研究します。
グラフは、各ステップで次に使用可能なノードを選択する際のエージェントの自由度を定義します。
グラフ構造は完全に利用可能であると仮定しますが、報酬の分布は不明です。
オフラインのグラフベースの計画アルゴリズムと楽観主義の原則に基づいて構築された、楽観主義の原則を使用して長期的な探査と搾取のバランスをとる学習アルゴリズム、G-UCB を設計します。
提案したアルゴリズムが $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ 学習後悔を達成することを示します。ここで、$|S|$ はノード数、$D です。
$ はグラフの直径で、理論的な下限 $\Omega(\sqrt{|S|T})$ と対数係数まで一致します。
私たちの知る限りでは、この結果は、既知の決定論的遷移を伴う非エピソード的で割引されていない学習問題における最初の厳しい後悔限界の 1 つです。
数値実験により、アルゴリズムがいくつかのベンチマークよりも優れていることが確認されています。

要約(オリジナル)

The multi-armed bandit(MAB) problem is a simple yet powerful framework that has been extensively studied in the context of decision-making under uncertainty. In many real-world applications, such as robotic applications, selecting an arm corresponds to a physical action that constrains the choices of the next available arms (actions). Motivated by this, we study an extension of MAB called the graph bandit, where an agent travels over a graph to maximize the reward collected from different nodes. The graph defines the agent’s freedom in selecting the next available nodes at each step. We assume the graph structure is fully available, but the reward distributions are unknown. Built upon an offline graph-based planning algorithm and the principle of optimism, we design a learning algorithm, G-UCB, that balances long-term exploration-exploitation using the principle of optimism. We show that our proposed algorithm achieves $O(\sqrt{|S|T\log(T)}+D|S|\log T)$ learning regret, where $|S|$ is the number of nodes and $D$ is the diameter of the graph, which matches the theoretical lower bound $\Omega(\sqrt{|S|T})$ up to logarithmic factors. To our knowledge, this result is among the first tight regret bounds in non-episodic, un-discounted learning problems with known deterministic transitions. Numerical experiments confirm that our algorithm outperforms several benchmarks.

arxiv情報

著者	Tianpeng Zhang,Kasper Johansson,Na Li
発行日	2023-03-20 17:06:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-armed Bandit Learning on a Graph

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー