Direct Preference Optimization Using Sparse Feature-Level Constraints

要約

大規模言語モデル (LLM) を人間の好みに合わせることが依然として重要な課題です。
ヒューマンフィードバックからの強化学習 (RLHF) や直接優先最適化 (DPO) などのトレーニング後の手法は顕著な成功を収めていますが、計算効率の低下やトレーニングの不安定性が生じることがよくあります。
この論文では、安定性を確保しながら調整プロセスを簡素化するように設計された新しい方法である、機能レベルの制約付き優先最適化 (FPO) を提案します。
FPO は、事前トレーニングされたスパースオートエンコーダー (SAE) を活用し、機能レベルの制約を導入して、効率的でスパース性を強制した調整を可能にします。
私たちのアプローチは、十分にトレーニングされたスパースオートエンコーダーでアクティブ化されたスパース特徴を使用することで効率性を実現し、特徴レベルのオフライン参照を使用することで逐次 KL 発散の品質を実現します。
ベンチマークデータセットの実験結果では、FPO が最先端のベースラインと比較してはるかに低い計算コストで勝率の 5.08% の絶対的な向上を達成し、効率的で制御可能な LLM アライメントの有望なソリューションとなることが実証されています。

要約(オリジナル)

The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

arxiv情報

著者	Qingyu Yin,Chak Tou Leong,Hongbo Zhang,Minjun Zhu,Hanqi Yan,Qiang Zhang,Yulan He,Wenjie Li,Jun Wang,Yue Zhang,Linyi Yang
発行日	2024-11-12 07:54:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Direct Preference Optimization Using Sparse Feature-Level Constraints

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー