Linear Contextual Bandits with Hybrid Payoff: Revisited

要約

ハイブリッド報酬設定における線形文脈バンディット問題を研究します。
この設定では、すべてのアームの報酬モデルには、すべてのアームの報酬モデル間で共有されるパラメーターに加えて、アーム固有のパラメーターが含まれます。
この設定を 2 つの密接に関連した設定 (a) 共有 – アーム固有のパラメーターなし、(b) ディスジョイント – アーム固有のパラメーターのみに減らすことができ、2 つの一般的な最先端アルゴリズム $\texttt{LinUCB}$ の適用が可能になります。
および $\texttt{DisLinUCB}$ ((Li et al. 2010) のアルゴリズム 1)。
アームの特徴が確率的であり、一般的な多様性条件を満たす場合、両方のアルゴリズムに新しいリグレス分析が提供され、これらのアルゴリズムの既知のリグレス保証が大幅に改善されます。
私たちの新しい分析は、ハイブリッド報酬構造と多様性条件を批判的に利用しています。
さらに、ハイブリッド設定でのスパース性を考慮して $\texttt{LinUCB}$ を (新しい探索係数を使用して) 大幅に変更する新しいアルゴリズム $\texttt{HyLinUCB}$ を導入します。
同じ多様性の仮定の下で、$\texttt{HyLinUCB}$ も $T$ ラウンドに対して $O(\sqrt{T})$ の後悔しか生じないことを証明します。
私たちは、合成データセットと現実世界のデータセットで広範な実験を実行し、$\texttt{HyLinUCB}$ の強力な経験的パフォーマンスを実証しています。共有パラメーターの数よりもはるかに大きいアーム固有のパラメーターの数については、$\texttt{DisLinUCB}$ が発生することが観察されています。
最低の後悔。
この場合、$\texttt{HyLinUCB}$ の後悔は 2 番目に優れており、$\texttt{DisLinUCB}$ と非常に競争力があります。
現実世界のデータセットを含む他のすべての状況では、$\texttt{HyLinUCB}$ は、$\texttt{LinUCB}$、$\texttt{DisLinUCB}$、および検討した他の SOTA ベースラインよりも大幅に後悔率が低くなります。
また、$\texttt{HyLinUCB}$ のリグロングはベースラインと比較してアームの数に応じて大きく増加するのがはるかに遅いため、非常に大きなアクションスペースにも適していることが経験的に観察されています。

要約(オリジナル)

We study the Linear Contextual Bandit problem in the hybrid reward setting. In this setting every arm’s reward model contains arm specific parameters in addition to parameters shared across the reward models of all the arms. We can reduce this setting to two closely related settings (a) Shared – no arm specific parameters, and (b) Disjoint – only arm specific parameters, enabling the application of two popular state of the art algorithms – $\texttt{LinUCB}$ and $\texttt{DisLinUCB}$ (Algorithm 1 in (Li et al. 2010)). When the arm features are stochastic and satisfy a popular diversity condition, we provide new regret analyses for both algorithms, significantly improving on the known regret guarantees of these algorithms. Our novel analysis critically exploits the hybrid reward structure and the diversity condition. Moreover, we introduce a new algorithm $\texttt{HyLinUCB}$ that crucially modifies $\texttt{LinUCB}$ (using a new exploration coefficient) to account for sparsity in the hybrid setting. Under the same diversity assumptions, we prove that $\texttt{HyLinUCB}$ also incurs only $O(\sqrt{T})$ regret for $T$ rounds. We perform extensive experiments on synthetic and real-world datasets demonstrating strong empirical performance of $\texttt{HyLinUCB}$.For number of arm specific parameters much larger than the number of shared parameters, we observe that $\texttt{DisLinUCB}$ incurs the lowest regret. In this case, regret of $\texttt{HyLinUCB}$ is the second best and extremely competitive to $\texttt{DisLinUCB}$. In all other situations, including our real-world dataset, $\texttt{HyLinUCB}$ has significantly lower regret than $\texttt{LinUCB}$, $\texttt{DisLinUCB}$ and other SOTA baselines we considered. We also empirically observe that the regret of $\texttt{HyLinUCB}$ grows much slower with the number of arms compared to baselines, making it suitable even for very large action spaces.

arxiv情報

著者	Nirjhar Das,Gaurav Sinha
発行日	2024-06-14 15:41:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Linear Contextual Bandits with Hybrid Payoff: Revisited

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー