Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

要約

固有の自己修正は、LLM の固有の能力のみに基づいたフィードバックプロンプトを介して LLM の応答を改善するために提案されました。
しかし、最近の研究では、フィードバックプロンプトとしてオラクルラベルがないと、LLM の本質的な自己修正が失敗することが示されています。
この論文では、さまざまなタスク、特に失敗のケースに対する LLM の本質的な自己修正を解釈することを目的としています。
ChatGPT ファミリ (o1、4o、3.5-turbo) や Llama ファミリ (2-7B、3-8B、3.1-8B) などの最先端 (SOTA) LLM を使用して、1 つの単純なタスクと 3 つの複雑なタスクを含めることによって
では、LLM の本質的な自己修正の暗い側面を明らかにするために 3 つの解釈方法を設計します。
私たちは、本質的な自己修正によって、(1) LLM が中間的な回答と最終的な回答の両方を揺るがせ、単純な事実に関する質問に対して即座に偏見をもたらす可能性があることを確認しました。
(2) 複雑なタスクに対して人間のような認知バイアスを導入します。
私たちの調査結果を踏まえて、私たちは緩和のための 2 つのシンプルかつ効果的な戦略も提供します。それは、質問を繰り返すことと、いくつかのサンプルを使用した教師付き微調整です。
私たちの仕事は https://x-isc.info/ でオープンソース化されています。

要約(オリジナル)

Intrinsic self-correction was proposed to improve LLMs’ responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.

arxiv情報

著者	Qingjie Zhang,Han Qiu,Di Wang,Haoting Qian,Yiming Li,Tianwei Zhang,Minlie Huang
発行日	2024-12-19 15:39:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー