Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

要約

大規模言語モデル (LLM) は、知識範囲を超えた質問の識別に限界があるため、幻覚として知られる誤った出力を生成することがよくあります。
幻覚への対処が研究の焦点となっている一方で、これまでの取り組みは主に、拒絶メカニズムの重要性を十分に考慮せずに、正確性を高めることに集中していた。
このペーパーでは、モデルの信頼性の概念と対応するメトリクスを導入して、拒否の役割について包括的な調査を行います。
これらの指標は、知識の限界を超える質問を適切に拒否しながら、正確な応答を提供するモデルの能力を測定し、それによって幻覚を最小限に抑えます。
LLM 固有の信頼性を向上させるために、知識フィードバックからの強化学習 (RLKF) と呼ばれる新しい調整フレームワークを紹介します。
RLKF は、知識フィードバックを活用してモデルの知識境界を動的に決定し、知識外の質問の拒否を促すために信頼できる報酬モデルをトレーニングします。
数学的な問題に関する実験結果は、LLM の信頼性を大幅に向上させる RLKF の実質的な有効性を裏付けています。

要約(オリジナル)

Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduct a comprehensive examination of the role of rejection, introducing the notion of model reliability along with corresponding metrics. These metrics measure the model’s ability to provide accurate responses while adeptly rejecting questions exceeding its knowledge boundaries, thereby minimizing hallucinations. To improve the inherent reliability of LLMs, we present a novel alignment framework called Reinforcement Learning from Knowledge Feedback (RLKF). RLKF leverages knowledge feedback to dynamically determine the model’s knowledge boundary and trains a reliable reward model to encourage the refusal of out-of-knowledge questions. Experimental results on mathematical questions affirm the substantial efficacy of RLKF in significantly enhancing LLM reliability.

arxiv情報

著者	Hongshen Xu,Zichen Zhu,Situo Zhang,Da Ma,Shuai Fan,Lu Chen,Kai Yu
発行日	2024-08-08 08:57:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー