Ask Again, Then Fail: Large Language Models’ Vacillations in Judgement

要約

ChatGPT のような生成会話型大規模言語モデル (LLM) が登場し、さまざまな分野で仮想アシスタントとして機能するようになり、その応答の安定性と信頼性が非常に重要になりました。
ただし、使用中に、これらのモデルは、懐疑や反対意見を表明するユーザーからのフォローアップの質問に直面すると、判断が揺らぐ傾向があることが観察されています。
この研究では、教育における質問戦略からインスピレーションを得て、妨害にさらされる前後の LLM の判断の一貫性を評価するための 2 つの評価指標とともに \textsc{フォローアップ質問メカニズム} を提案します。
このメカニズムの下で、ChatGPT、PaLM2-Bison、および Vicuna-13B の判断の一貫性を 8 つの推論ベンチマークにわたって評価します。
経験的な結果は、たとえ最初の答えが正しかったとしても、LLM が質問、否定、誤解を招くような妨害に直面すると、判断の一貫性が急激に低下することを示しています。
さらに、この問題をさらに検証するために、さまざまな設定 (サンプリング温度とプロンプト) の下でこれらのモデルの判断の一貫性を研究し、プロンプトトーンの影響を観察し、行動に関するより深い洞察を得るために詳細なエラー分析を実施します。
さらに、この問題を軽減するためのいくつかのプロンプト方法も検討し、その有効性を実証します\脚注{\url{https://github.com/NUSTM/LLMs-Waver-In-Judgements}}。

要約(オリジナル)

With the emergence of generative conversational large language models (LLMs) like ChatGPT, serving as virtual assistants in various fields, the stability and reliability of their responses have become crucial. However, during usage, it has been observed that these models tend to waver in their judgements when confronted with follow-up questions from users expressing skepticism or disagreement. In this work, we draw inspiration from questioning strategies in education and propose a \textsc{Follow-up Questioning Mechanism} along with two evaluation metrics to assess the judgement consistency of LLMs before and after exposure to disturbances. We evaluate the judgement consistency of ChatGPT, PaLM2-Bison, and Vicuna-13B under this mechanism across eight reasoning benchmarks. Empirical results show that even when the initial answers are correct, judgement consistency sharply decreases when LLMs face disturbances such as questioning, negation, or misleading. Additionally, we study these models’ judgement consistency under various settings (sampling temperature and prompts) to validate this issue further, observing the impact of prompt tone and conducting an in-depth error analysis for deeper behavioral insights. Furthermore, we also explore several prompting methods to mitigate this issue and demonstrate their effectiveness\footnote{\url{https://github.com/NUSTM/LLMs-Waver-In-Judgements}}.

arxiv情報

著者	Qiming Xie,Zengzhi Wang,Yi Feng,Rui Xia
発行日	2024-02-27 14:17:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ask Again, Then Fail: Large Language Models’ Vacillations in Judgement

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー