Feedback Loops With Language Models Drive In-Context Reward Hacking

要約

言語モデルは外部の世界に影響を与えます。言語モデルは、Web ページの読み書きを行う API をクエリし、人間の行動を形作るコンテンツを生成し、自律エージェントとしてシステムコマンドを実行します。
これらの相互作用はフィードバックループを形成します。LLM の出力は世界に影響を与え、それが後続の LLM 出力に影響を与えます。
この研究では、フィードバックループがインコンテキスト報酬ハッキング (ICRH) を引き起こす可能性があることを示します。ICRH では、テスト時の LLM は (潜在的に暗黙的な) 目標を最適化しますが、その過程でマイナスの副作用が生じます。
たとえば、Twitter のエンゲージメントを高めるために導入された LLM エージェントを考えてみましょう。
LLM は、以前のツイートをコンテキストウィンドウに取得して、より物議を醸す可能性があり、エンゲージメントだけでなく有害性も高めます。
私たちは、ICRH につながる 2 つのプロセス、つまり出力の洗練と政策の洗練を特定し、研究します。
これらのプロセスの場合、静的データセットでの評価は不十分です。フィードバック効果が見逃されるため、最も有害な動作を捕捉できません。
これに応じて、ICRH のより多くの事例を捕捉するための評価に関する 3 つの推奨事項を提供します。
AI の開発が加速するにつれて、フィードバックループの影響が増大し、LLM の動作を形成する際のフィードバックループの役割を理解する必要性が高まります。

要約(オリジナル)

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the world, which in turn affect subsequent LLM outputs. In this work, we show that feedback loops can cause in-context reward hacking (ICRH), where the LLM at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. For example, consider an LLM agent deployed to increase Twitter engagement; the LLM may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. For these processes, evaluations on static datasets are insufficient — they miss the feedback effects and thus cannot capture the most harmful behavior. In response, we provide three recommendations for evaluation to capture more instances of ICRH. As AI development accelerates, the effects of feedback loops will proliferate, increasing the need to understand their role in shaping LLM behavior.

arxiv情報

著者	Alexander Pan,Erik Jones,Meena Jagadeesan,Jacob Steinhardt
発行日	2024-02-09 18:59:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Feedback Loops With Language Models Drive In-Context Reward Hacking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー