SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

要約

大規模な言語モデル（LLMS）が進んでおり、ますます多くのフィールドでアプリケーションを見つけ続け、LLMの安全性がますます重要になっていることを保証します。
安全性の懸念に対処するために、最近の研究では、人間のフィードバック（RLHF）からの強化学習への安全性の制約を統合することを提案しています。
ただし、これらのアプローチは複雑になる傾向があります。これは、RLHFでの複雑な手順と、安全上の制約に必要な追加の手順を含むためです。
直接優先最適化（DPO）に触発されて、safedpoと呼ばれる新しいアルゴリズムを紹介します。これは、リラクゼーションを必要とせずに、ポリシー学習の単一段階で安全アライメント目標を直接最適化するように設計されています。
SafedPoは、安全性をさらに高めるために1つの追加のハイパーパラメーターのみを導入し、標準DPOに軽微な変更のみを必要とします。
その結果、LLMSの安全性を高めながら、微調整中に個別の報酬モデルを適合させたり、微調整中に言語モデルからサンプリングする必要性を排除します。
最後に、SAFEDPOは、人間の好みに合わせて安全性を向上させるという点で、最先端の安全アライメントアルゴリズムと比較して競争力のあるパフォーマンスを達成することを実証します。

要約(オリジナル)

As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.

arxiv情報

著者	Geon-Hyeong Kim,Youngsoo Jang,Yu Jin Kim,Byoungjip Kim,Honglak Lee,Kyunghoon Bae,Moontae Lee
発行日	2025-05-26 14:50:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー