Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents

要約

コンテキスト内補強学習（ICRL）は、基礎モデルの時代の強化学習（RL）の問題を解決するためのフロンティアパラダイムです。
ICRL機能は、タスク固有のトレーニングを通じて変圧器で実証されていますが、すぐにボックス外の大規模な言語モデル（LLM）の可能性はほとんど未踏のままです。
このペーパーでは、LLMSがクロスドメインを一般化して、Stateless PreferenceベースのRL設定であるDueling Bandits（DB）の問題の下でICRLを実行できるかどうかを調査します。
トップパフォーマンスのLLMは、相対的な意思決定のために顕著なゼロショット容量を示していることがわかります。これは、デュエルスの最高の腕を含むすべてのDB環境インスタンスでの短期的な弱い後悔が低いことを意味します。
ただし、強い後悔という点で、LLMSと古典的なDBアルゴリズムの間に最適性のギャップが存在します。
LLMSは、明示的にそうするように促されたとしても、収束と一貫して活用するのに苦労し、迅速な変動に敏感です。
このギャップを埋めるために、エージェントフローフレームワーク：強化されたアルゴリズムの決闘（LEAD）を備えたLLMを提案します。これは、株式のDBアルゴリズムサポートとLLMエージェントと微調整された適応的相互作用を統合します。
リードには、弱い後悔と強い後悔の両方で、古典的なDBアルゴリズムから継承された理論的保証があることを示します。
騒々しいプロンプトでさえ、その有効性と堅牢性を検証します。
このようなエージェントフレームワークの設計は、コンテキスト内の意思決定タスクに一般化された汎用LLMの信頼性を高める方法に光を当てています。

要約(オリジナル)

In-Context Reinforcement Learning (ICRL) is a frontier paradigm to solve Reinforcement Learning (RL) problems in the foundation model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of Large Language Models (LLMs) out-of-the-box remains largely unexplored. This paper investigates whether LLMs can generalize cross-domain to perform ICRL under the problem of Dueling Bandits (DB), a stateless preference-based RL setting. We find that the top-performing LLMs exhibit a notable zero-shot capacity for relative decision-making, which translates to low short-term weak regret across all DB environment instances by quickly including the best arm in duels. However, an optimality gap still exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly prompted to do so, and are sensitive to prompt variations. To bridge this gap, we propose an agentic flow framework: LLM with Enhanced Algorithmic Dueling (LEAD), which integrates off-the-shelf DB algorithm support with LLM agents through fine-grained adaptive interplay. We show that LEAD has theoretical guarantees inherited from classic DB algorithms on both weak and strong regret. We validate its efficacy and robustness even with noisy and adversarial prompts. The design of such an agentic framework sheds light on how to enhance the trustworthiness of general-purpose LLMs generalized to in-context decision-making tasks.

arxiv情報

著者	Fanzeng Xia,Hao Liu,Yisong Yue,Tongxin Li
発行日	2025-06-09 14:56:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー