Safer-Instruct: Aligning Language Models with Automated Preference Data

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、言語モデルのモデル機能を強化するための重要な戦略です。
ただし、RLHF の嗜好データに注釈を付けるのはリソースを大量に消費し、創造性を要求されるプロセスですが、既存の自動生成方法ではデータの多様性と品質の限界に直面しています。
これに応えて、大規模な嗜好データを自動的に構築するための新しいパイプラインである Safer-Instruct を紹介します。
私たちのアプローチは、逆命令チューニング、命令誘導、エキスパートモデル評価を活用して、人間によるアノテーターなしで高品質の嗜好データを効率的に生成します。
Safer-Instruct の有効性を検証するために、ケーススタディとしてパイプラインを適用して安全性優先データセットを構築します。
この合成データセットでアルパカモデルを微調整すると、無害性の向上が実証されるだけでなく、下流タスクでの競争力を維持しながら、人間が注釈を付けた安全嗜好データに基づいて微調整したモデルよりも優れたパフォーマンスを発揮します。
重要なのは、当社の Safer-Instruct フレームワークは多用途であり、さまざまなドメインにわたる好みのデータを生成するために適用でき、安全性の好みを超えてその有用性を拡張できます。
これにより、嗜好データ取得における課題に対処し、より有能で責任ある AI システムの開発が促進されます。
データセットとコードの実装については、https://github.com/uscnlp-lime/safer-instruct を参照してください。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) is a vital strategy for enhancing model capability in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while existing automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for automatically constructing large-scale preference data. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. To verify the effectiveness of Safer-Instruct, we apply the pipeline to construct a safety preference dataset as a case study. Finetuning an Alpaca model on this synthetic dataset not only demonstrates improved harmlessness but also outperforms models fine-tuned on human-annotated safety preference data, all the while maintaining a competitive edge in downstream tasks. Importantly, our Safer-Instruct framework is versatile and can be applied to generate preference data across various domains, extending its utility beyond safety preferences. It addresses the challenges in preference data acquisition and advances the development of more capable and responsible AI systems. For dataset and code implementation, see https://github.com/uscnlp-lime/safer-instruct

arxiv情報

著者	Taiwei Shi,Kai Chen,Jieyu Zhao
発行日	2024-03-31 22:42:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Safer-Instruct: Aligning Language Models with Automated Preference Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー