Diverging Preferences: When do Annotators Disagree and do Models Know?

要約

私たちは、人間がラベルを付けた嗜好データセットの多様な嗜好を調べます。
私たちは、タスクの仕様不足、応答スタイル、拒否、注釈エラーなど、4 つの高レベルクラスにわたる 10 のカテゴリにまたがる意見の相違の原因の分類を作成します。
不一致の大部分は、アノテーターの不一致がノイズであるという前提で設計された標準的な報酬モデリングのアプローチに反対していることがわかりました。
次に、これらの発見が LLM 開発の 2 つの領域、報酬モデリングと評価にどのような影響を与えるかを調査します。
私たちの実験では、Bradley-Terry モデルのような標準的な報酬モデリング手法では、特定の好みの判断がアノテーター間の満場一致の合意の結果であるのか、それとも異なるユーザーの好みの間の多数派の意見によるのかを区別できないことを示しています。
また、これらの傾向は、好みが異なる場合に勝者の反応を一貫して特定する、一般的な LLM-as-Judge 評価方法にも反映されていることがわかりました。
これらの発見は、応答スタイルなどの分裂的な特徴によって大きく影響される LLM 評価、および多元的に調整された LLM の開発における残された課題を浮き彫りにしています。
これらの問題に対処するために、私たちは異なる好みを特定し、評価やトレーニングへの影響を軽減する方法を開発します。

要約(オリジナル)

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes — task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.

arxiv情報

著者	Michael JQ Zhang,Zhilin Wang,Jena D. Hwang,Yi Dong,Olivier Delalleau,Yejin Choi,Eunsol Choi,Xiang Ren,Valentina Pyatkin
発行日	2024-10-18 17:32:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Diverging Preferences: When do Annotators Disagree and do Models Know?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー