Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

要約

言語モデル (LM) をプロンプトするとき、ユーザーは多くの場合、そのモデルが有害な言葉や偏った言葉を避けながら洞察力に富んだコンテンツを作成するなど、さまざまなタスクにわたって一連の行動原則に従うことを期待します。
このような原則 (つまり、構成) をモデルに組み込むことは、リソースを大量に消費し、技術的に困難であり、一般に人間の好みのラベルや例が必要です。
SAMI は、事前トレーニング済みの言語モデルを (優先ラベルやデモンストレーションを必要とせずに) 微調整して、データセットからのクエリが与えられたときの構成と自己生成応答の間の条件付き相互情報を増加させる反復アルゴリズムです。
シングルターンの対話と要約では、SAMI でトレーニングされた mistral-7b は、初期の事前トレーニング済みモデルよりも優れたパフォーマンスを示し、勝率は 66% ～ 77% でした。
驚くべきことに、シングルターン対話での勝率は 55% ～ 57% で、命令で微調整されたベースライン (mistral-7b-instruct) も上回っています。
SAMI には原則を記述するモデルが必要です。
原則を記述するための強力なモデルへの依存を避けるために、弱い命令微調整モデル (mistral-7b-instruct) によって記述された構成を使用して強力な事前トレーニング済みモデル (mixtral-8x7b) を調整し、要約で 65% の勝率を達成しました。
最後に、SAMI が多様な要約原則 (「要約は科学的であるべき」など) に一般化し、より強力なモデル (llama3-70b) に拡張できるかどうかを調査し、学習済みの場合は最大 68%、保持済みの場合は 67% の勝率を達成していることがわかりました。
ベースモデルと比較した -out 原則。
私たちの結果は、事前訓練されたLMは、好みのラベル、デモンストレーション、または人間の監視を使用せずに、憲法に従うことを学習できることを示しています。

要約(オリジナル)

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., ‘summaries should be scientific’) and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

arxiv情報

著者	Jan-Philipp Fränken,Eric Zelikman,Rafael Rafailov,Kanishk Gandhi,Tobias Gerstenberg,Noah D. Goodman
発行日	2024-05-21 17:31:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー