LC-Tsalis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

要約

この研究では、独立かつ同一に分散された (i.i.d.) コンテキストを使用した線形コンテキストバンディット問題を検討します。
この問題では、既存の研究では、準最適性ギャップのある確率的領域でラウンド $T$ の回数 $O(\log^2(T))$ を満たすリグレスをもつ Best-of- Both-Worlds (BoBW) アルゴリズムが提案されています。
敵対的体制で $O(\sqrt{T})$ を満たす一方で、正の定数によって下限が制限されます。
ただし、$T$ への依存関係には改善の余地があり、準最適性ギャップの仮定を緩和することができます。
この問題に対して、本研究では、準最適性ギャップが下限にある場合の設定においてリグレスが $O(\log(T))$ を満たすアルゴリズムを提案します。
さらに、次善のギャップに関するより穏やかな仮定であるマージン条件を導入します。
この条件は、パラメーター $\beta \in (0, \infty]$ を使用して、準最適性のギャップに関連する問題の難易度を特徴付けます。次に、アルゴリズムのリグロングが $O\left(\left\{\log(T)\ を満たすことを示します)
right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$。ここで、$\beta= \infty$ が対応します
既存の研究では準最適性のギャップに下限が存在し、その場合の後悔は $O(\log(T))$ を満たすというケースに当てはめて考えます。
Tsallis エントロピーであり、$\alpha$-Linear-Contextual (LC)-Tsallis-INF と呼ばれます。

要約(オリジナル)

This study considers the linear contextual bandit problem with independent and identically distributed (i.i.d.) contexts. In this problem, existing studies have proposed Best-of-Both-Worlds (BoBW) algorithms whose regrets satisfy $O(\log^2(T))$ for the number of rounds $T$ in a stochastic regime with a suboptimality gap lower-bounded by a positive constant, while satisfying $O(\sqrt{T})$ in an adversarial regime. However, the dependency on $T$ has room for improvement, and the suboptimality-gap assumption can be relaxed. For this issue, this study proposes an algorithm whose regret satisfies $O(\log(T))$ in the setting when the suboptimality gap is lower-bounded. Furthermore, we introduce a margin condition, a milder assumption on the suboptimality gap. That condition characterizes the problem difficulty linked to the suboptimality gap using a parameter $\beta \in (0, \infty]$. We then show that the algorithm’s regret satisfies $O\left(\left\{\log(T)\right\}^{\frac{1+\beta}{2+\beta}}T^{\frac{1}{2+\beta}}\right)$. Here, $\beta= \infty$ corresponds to the case in the existing studies where a lower bound exists in the suboptimality gap, and our regret satisfies $O(\log(T))$ in that case. Our proposed algorithm is based on the Follow-The-Regularized-Leader with the Tsallis entropy and referred to as the $\alpha$-Linear-Contextual (LC)-Tsallis-INF.

arxiv情報

著者	Masahiro Kato,Shinji Ito
発行日	2024-03-05 18:59:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LC-Tsalis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー