Provably Efficient Learning in Partially Observable Contextual Bandit

要約

この論文では、エージェントが他のエージェントからの限られた知識と隠れた交絡因子に関する部分的な情報しか持たない、部分的に観察可能なコンテキストバンディットにおける転移学習を調査します。
まず、最適化問題を通じて、アクションと報酬の間の因果関係を特定または部分的に特定する問題に変換します。
これらの最適化問題を解決するために、未知の分布の元の関数制約を線形制約に離散化し、線形計画法を逐次解くことで互換性のある因果モデルをサンプリングし、推定誤差を考慮して因果境界を取得します。
当社のサンプリングアルゴリズムは、適切なサンプリング分布に対して望ましい収束結果を提供します。
次に、因果境界を適用して古典的なバンディットアルゴリズムを改善し、アクションセットと関数空間のサイズに関するリグレスに影響を与える方法を示します。
特に、一般的なコンテキスト分布を処理できる関数近似を使用したタスクでは、私たちの方法は以前の文献と比較して関数空間サイズへの次数依存性を改善しました。
私たちは、因果的に強化されたアルゴリズムが古典的なバンディットアルゴリズムを上回り、桁違いに速い収束速度を達成することを正式に証明しました。
最後に、現在の最先端の手法と比較して戦略の効率性を実証するシミュレーションを実行します。
この研究は、データが不足し、取得にコストがかかる現実世界のアプリケーションにおいて、コンテキストバンディットエージェントのパフォーマンスを向上させる可能性があります。

要約(オリジナル)

In this paper, we investigate transfer learning in partially observable contextual bandits, where agents have limited knowledge from other agents and partial information about hidden confounders. We first convert the problem to identifying or partially identifying causal effects between actions and rewards through optimization problems. To solve these optimization problems, we discretize the original functional constraints of unknown distributions into linear constraints, and sample compatible causal models via sequentially solving linear programmings to obtain causal bounds with the consideration of estimation error. Our sampling algorithms provide desirable convergence results for suitable sampling distributions. We then show how causal bounds can be applied to improving classical bandit algorithms and affect the regrets with respect to the size of action sets and function spaces. Notably, in the task with function approximation which allows us to handle general context distributions, our method improves the order dependence on function space size compared with previous literatures. We formally prove that our causally enhanced algorithms outperform classical bandit algorithms and achieve orders of magnitude faster convergence rates. Finally, we perform simulations that demonstrate the efficiency of our strategy compared to the current state-of-the-art methods. This research has the potential to enhance the performance of contextual bandit agents in real-world applications where data is scarce and costly to obtain.

arxiv情報

著者	Xueping Gong,Jiheng Zhang
発行日	2023-08-07 13:24:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Provably Efficient Learning in Partially Observable Contextual Bandit

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー