Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

要約

私たちは、盗賊のフィードバックを使用して学習するためのモデルを提案しますが、決定論的に進化し、観測不能な状態を説明し、決定論的に進化する状態（$ b $-$ des $）を持つ盗賊を呼び出します。
私たちのモデルの主力アプリケーションは、推奨システムの学習とオンライン広告の学習です。
どちらの場合も、各ラウンドでアルゴリズムが取得する報酬は、選択されたアクションの短期的な報酬と、システムの「健康」（つまり、その状態で測定されているように）の関数です。
たとえば、推奨システムでは、プラットフォームが特定のタイプのコンテンツとのエンゲージメントからプラットフォームが取得する報酬は、特定のコンテンツの固有の機能だけでなく、ユーザーの好みがどのように進化したかにも依存します。
プラットフォーム上の他のタイプのコンテンツ。
私たちの一般的なモデルは、状態が進化する異なるレート$ \ lambda \ in [0,1] $を説明します（たとえば、以前のコンテンツ消費の結果としてユーザーの好みがどれだけ速くシフトしますか）。
特別なケース。
アルゴリズムの目標は、引っ張られた最良の一連のアームに対する後悔の概念を最小限に抑えることです。
進化率$ \ lambda $の可能性のある値については、オンライン学習アルゴリズムを提示し、さまざまなモデルの誤りに対する結果の堅牢性を示します。

要約(オリジナル)

We propose a model for learning with bandit feedback while accounting for deterministically evolving and unobservable states that we call Bandits with Deterministically Evolving States ($B$-$DES$). The workhorse applications of our model are learning for recommendation systems and learning for online ads. In both cases, the reward that the algorithm obtains at each round is a function of the short-term reward of the action chosen and how ‘healthy’ the system is (i.e., as measured by its state). For example, in recommendation systems, the reward that the platform obtains from a user’s engagement with a particular type of content depends not only on the inherent features of the specific content, but also on how the user’s preferences have evolved as a result of interacting with other types of content on the platform. Our general model accounts for the different rate $\lambda \in [0,1]$ at which the state evolves (e.g., how fast a user’s preferences shift as a result of previous content consumption) and encompasses standard multi-armed bandits as a special case. The goal of the algorithm is to minimize a notion of regret against the best-fixed sequence of arms pulled, which is significantly harder to attain compared to standard benchmark of the best-fixed action in hindsight. We present online learning algorithms for any possible value of the evolution rate $\lambda$ and we show the robustness of our results to various model misspecifications.

arxiv情報

著者	Khashayar Khosravi,Renato Paes Leme,Chara Podimata,Apostolis Tsorvantzis
発行日	2025-01-28 18:01:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preferences Evolve And So Should Your Bandits: Bandits with Evolving States for Online Platforms

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー