Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related Rewards

要約

因果関係のある報酬を伴う区分的定常組み合わせセミバンディット問題を研究します。
私たちの非定常環境では、ベースアームの分布の変動、報酬間の因果関係、またはその両方により、報酬生成プロセスが変化します。
このような環境では、最適な意思決定者は両方の変化源に従い、それに応じて適応する必要があります。
この問題は、意思決定者が選択された武器の束の結果のみを観察する、組み合わせによる半盗賊の設定ではさらに悪化します。
私たちが提案するポリシーの中核は、上限信頼限界 (UCB) アルゴリズムです。
エージェントが適応的なアプローチに依存して課題を克服すると仮定します。
より具体的には、一般化尤度比 (GLR) テストに基づく変化点検出器を採用します。
さらに、構造化された環境における意思決定プロセスにおける新しい代替の再起動戦略として、グループ再起動の概念を導入します。
最後に、私たちのアルゴリズムは、基礎となるグラフ構造の変化を追跡するメカニズムを統合し、バンディット設定における報酬間の因果関係を捕捉します。
理論的には、構造および分布の変更の数がパフォーマンスに及ぼす影響を反映するリグレスの上限を確立します。
現実世界のシナリオにおける数値実験の結果は、最先端のベンチマークと比較して、私たちの提案の適用性と優れたパフォーマンスを示しています。

要約(オリジナル)

We study the piecewise stationary combinatorial semi-bandit problem with causally related rewards. In our nonstationary environment, variations in the base arms’ distributions, causal relationships between rewards, or both, change the reward generation process. In such an environment, an optimal decision-maker must follow both sources of change and adapt accordingly. The problem becomes aggravated in the combinatorial semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms. The core of our proposed policy is the Upper Confidence Bound (UCB) algorithm. We assume the agent relies on an adaptive approach to overcome the challenge. More specifically, it employs a change-point detector based on the Generalized Likelihood Ratio (GLR) test. Besides, we introduce the notion of group restart as a new alternative restarting strategy in the decision making process in structured environments. Finally, our algorithm integrates a mechanism to trace the variations of the underlying graph structure, which captures the causal relationships between the rewards in the bandit setting. Theoretically, we establish a regret upper bound that reflects the effects of the number of structural- and distribution changes on the performance. The outcome of our numerical experiments in real-world scenarios exhibits applicability and superior performance of our proposal compared to the state-of-the-art benchmarks.

arxiv情報

著者	Behzad Nourani-Koliji,Steven Bilaj,Amir Rezaei Balef,Setareh Maghsudi
発行日	2023-07-26 12:06:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー