Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

要約

通常、人間のフィードバック（RLHF）からの強化学習を通じて、人間の好みに関する大規模な言語モデル（LLM）の微調整は、能力の向上に成功していることが証明されています。
ただし、微調整中にLLMの安全性を確保することは依然として重要な懸念事項であり、RLHFでは安全性と有用性における潜在的な対立を軽減することは費用がかかります。
この問題に対処するために、安全性と有用性の両方の共同RLHF目標を単一の監視された学習目標に再パラメータ化する双頭嗜好最適化（BFPO）と呼ばれる監視された学習フレームワークを提案します。
監視された最適化では、ラベリング関数を使用して、安全性と有用性の両方のバランスをとるために、グローバルな好みのランキングをキャプチャします。
BFPOを評価するために、協力と無害性のための包括的な識別および生成タスクを含むベンチマークを開発します。
結果は、私たちの方法が、安全性と有用性の両方で既存のアプローチを大幅に上回ることを示しています。
さらに、BFPOは、計算リソースと人間のプロンプトと注釈のプロセスの10 \％未満で人間の労働に大きく依存している方法と同じレベルの安全性を達成しています。
トレーニングレシピは、https：//github.com/wx-zhang/bfpoにあります。

要約(オリジナル)

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.

arxiv情報

著者	Wenxuan Zhang,Philip H. S. Torr,Mohamed Elhoseiny,Adel Bibi
発行日	2025-04-08 11:04:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー