More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

要約

大規模言語モデル (LLM) の開発の急増により、認知タスクのパフォーマンスが向上するとともに、その力を安全に活用するためにこれらのモデルを人間の価値観に合わせることが緊急に必要となっています。
人間の好みを調整する上で、ヒューマンフィードバックからの強化学習 (RLHF) のような好み学習アルゴリズムが有効であるにもかかわらず、モデルの信頼性に関する想定された改善は完全には証明されていません。
この目的に向けて、この研究では、有用性と無害性に関する汎用の選好データに合わせたモデルが、毒性、固定観念、機械倫理、真実性、プライバシーという 5 つの信頼性の分野にわたってどのように機能するかを調査します。
モデルの調整では、広く使用されている 3 つの RLHF バリアント、教師あり微調整 (SFT)、近接ポリシー最適化 (PPO)、および直接優先最適化 (DPO) に焦点を当てます。
広範な実証的調査を通じて、RLHF による信頼性の向上は保証されたものではなく、嗜好データ、調整アルゴリズム、および特定の信頼性の側面の間に複雑な相互作用が存在することがわかりました。
これらの結果を総合すると、モデルの調整にはより微妙なアプローチが必要であることが強調されます。
モデルの調整内でのこれらのコンポーネントの複雑なダイナミクスに光を当てることで、この研究がコミュニティを、機能と信頼できる言語モデルの開発に導くことを願っています。

要約(オリジナル)

The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven’t been thoroughly testified. Toward this end, this study investigates how models that have been aligned with general-purpose preference data on helpfulness and harmlessness perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. For model alignment, we focus on three widely used RLHF variants: Supervised Finetuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Through extensive empirical investigations, we discover that the improvement in trustworthiness by RLHF is far from guaranteed, and there exists a complex interplay between preference data, alignment algorithms, and specific trustworthiness aspects. Together, our results underscore the need for more nuanced approaches for model alignment. By shedding light on the intricate dynamics of these components within model alignment, we hope this research will guide the community towards developing language models that are both capable and trustworthy.

arxiv情報

著者	Aaron J. Li,Satyapriya Krishna,Himabindu Lakkaraju
発行日	2024-04-29 17:00:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー