Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

要約

この論文では、学習エージェントがコンテキスト情報に基づいて品揃えを順次選択し、ユーザーのフィードバックが MNL 選択モデルに従うコンテキスト多項ロジット (MNL) バンディット問題を研究します。
特に特徴の次元 $d$ と最大品揃えサイズ $K$ に関して、リグレス限界の下限と上限の間には大きな差異があります。
さらに、これらの境界間の報酬構造の変化により、最適性の追求が複雑になります。
すべてのアイテムが同じ期待報酬を持つ均一報酬の下で、$\Omega(d\sqrt{\smash[b]{T/K}})$ という後悔の下限を確立し、定数時間アルゴリズム OFU を提案します。
-MNL+、$\tilde{O}(d\sqrt{\smash[b]{T/K}})$ の一致する上限を達成します。
不均一な報酬の下で、$\Omega(d\sqrt{T})$ の下限と $\tilde{O}(d\sqrt{T})$ の上限を証明します。これも OFU によって達成可能です。
MNL+。
私たちの実証研究は、これらの理論的発見を裏付けています。
私たちの知る限り、これは、均一または不均一な報酬設定のミニマックス最適性を証明し、この最適性を対数因数まで達成する計算効率の高いアルゴリズムを提案した、コンテキスト MNL バンディット文献の最初の研究です。
。

要約(オリジナル)

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension $d$ and the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality — for either uniform or non-uniform reward setting — and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.

arxiv情報

著者	Joongkyu Lee,Min-hwan Oh
発行日	2024-06-04 15:34:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー