Ask Again, Then Fail: Large Language Models’ Vacillations in Judgement

要約

ChatGPTのような生成的会話大規模言語モデル（LLM）が登場し、様々な分野でバーチャルアシスタントの役割を果たす中、その応答の安定性と信頼性が重要となっている。しかし、これらのモデルは、使用中に、懐疑的な態度や反対意見を示すユーザからのフォローアップの質問に直面すると、その判断が揺らぐ傾向があることが観察されている。そこで本研究では、教育現場での質問方法からヒントを得て、LLMの判断の一貫性を評価するための2つの評価指標を提案します。このメカニズムの下で、ChatGPT、PaLM2-Bison、Vicuna-13Bの判断の一貫性を8つの推論ベンチマークで評価する。その結果、最初の答えが正しくても、LLMが質問、否定、ミスリードなどの妨害に直面すると、判断の一貫性が急激に低下することが実証された。さらに、この問題を検証するために、様々な設定（サンプリング温度とプロンプト）の下でこれらのモデルの判断一貫性を研究し、プロンプトのトーンの影響を観察し、より深い行動洞察のために詳細なエラー分析を行う。さらに、この問題を緩和するために、いくつかのプロンプト手法を探求し、その有効性を示すfootfootnote{url{https://github.com/NUSTM/LLMs-Waver-In-Judgements}}。

要約(オリジナル)

With the emergence of generative conversational large language models (LLMs) like ChatGPT, serving as virtual assistants in various fields, the stability and reliability of their responses have become crucial. However, during usage, it has been observed that these models tend to waver in their judgements when confronted with follow-up questions from users expressing skepticism or disagreement. In this work, we draw inspiration from questioning strategies in education and propose a \textsc{Follow-up Questioning Mechanism} along with two evaluation metrics to assess the judgement consistency of LLMs before and after exposure to disturbances. We evaluate the judgement consistency of ChatGPT, PaLM2-Bison, and Vicuna-13B under this mechanism across eight reasoning benchmarks. Empirical results show that even when the initial answers are correct, judgement consistency sharply decreases when LLMs face disturbances such as questioning, negation, or misleading. Additionally, we study these models’ judgement consistency under various settings (sampling temperature and prompts) to validate this issue further, observing the impact of prompt tone and conducting an in-depth error analysis for deeper behavioral insights. Furthermore, we also explore several prompting methods to mitigate this issue and demonstrate their effectiveness\footnote{\url{https://github.com/NUSTM/LLMs-Waver-In-Judgements}}.

arxiv情報

著者	Qiming Xie,Zengzhi Wang,Yi Feng,Rui Xia
発行日	2023-10-03 16:08:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Ask Again, Then Fail: Large Language Models’ Vacillations in Judgement

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー