Human Feedback is not Gold Standard

要約

人間によるフィードバックは、大規模言語モデルのパフォーマンスを評価するための事実上の標準となっており、トレーニングの目的として使用されることが増えています。
ただし、この単一の「優先」スコアが生成された出力のどのプロパティをキャプチャするかは明らかではありません。
私たちは、選好スコアは主観的であり、望ましくないバイアスが生じやすいと仮説を立てます。
私たちは、トレーニングと評価の両方における人間によるフィードバックの使用を批判的に分析し、それが一連の重要なエラー基準を完全に捉えているかどうかを検証します。
選好スコアはかなり良好にカバーされているものの、事実性などの重要な側面が過小評価されていることがわかりました。
さらに、選好スコアとエラーアノテーションの両方が交絡因子の影響を受ける可能性があると仮説を立て、命令調整モデルを活用して、考えられる 2 つの交絡要素である積極性と複雑さの側面に沿って変化する出力を生成します。
出力の積極性が事実誤認の知覚率を歪め、人による注釈が完全に信頼できる評価指標またはトレーニング目標ではないことを示しています。
最後に、人間のフィードバックをトレーニング目標として使用すると、モデル出力の積極性が不釣り合いに増加するという予備的な証拠を提供します。
今後の作業では、選好スコアが望ましい目的と適切に一致しているかどうかを慎重に検討することをお勧めします。

要約(オリジナル)

Human feedback has become the de facto standard for evaluating the performance of Large Language Models, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference’ score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.

arxiv情報

著者	Tom Hosking,Phil Blunsom,Max Bartolo
発行日	2023-09-28 11:18:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human Feedback is not Gold Standard

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー