Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

要約

人間のフィードバックから学ぶことは、大規模な言語モデル（LLM）などの生成モデルを調整する上で重要な役割を果たします。
ただし、このアプローチの有効性は敵の影響を受ける可能性があります。敵は、望ましくないまたは有害な方向に出力を操作するために誤解を招く好みを意図的に提供する可能性があります。
この課題に取り組むために、この問題ドメイン – 敵対的なフィードバックを備えたこの問題の決闘の決闘盗賊の特定のモデルを研究します。そこでは、真の優先ラベルを敵によってひっくり返すことができます。
不確実性加重最尤推定に基づいた、堅牢なコンテキストデュエルバンディット（RCDB）を提案します。
私たちのアルゴリズムは$ \ Tilde O（d \ sqrt {t}/\ kappa+dc/\ kappa）$後悔を達成します。
また、後悔の境界がほぼ最適であることを示す下限を証明します。
私たちの仕事は、敵対的な選好フィードバックの存在下で、決闘の盗賊に対してほぼミニマックスの最適後悔を達成した最初の仕事です。
さらに、シグモイドリンク関数については、リンク関数の導関数を推定するための洗練された方法を介して、局所導関数の最尤推定（MLE）分析への影響を考慮した新しいアルゴリズムを開発します。
この方法は、$ t $に関して主要な用語での$ \ kappa $の依存を排除するのに役立ちます。これにより、パラメーター半径$ b $の指数関数的依存性が多項式依存性に減少します。

要約(オリジナル)

Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain–contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}/\kappa+dC/\kappa)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, $\kappa$ is the lower bound of the derivative of the link function, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, for the sigmoid link function, we develop a novel algorithm that takes into account the effect of local derivatives into maximum likelihood estimation (MLE) analysis through a refined method for estimating the link function’s derivative. This method helps us to eliminate the $\kappa$ dependence in the leading term with respect to $T$, which reduces the exponential dependence on the parameter radius $B$ to a polynomial dependence.

arxiv情報

著者	Qiwei Di,Jiafan He,Quanquan Gu
発行日	2025-02-28 18:56:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー