A Critical Evaluation of AI Feedback for Aligning Large Language Models

要約

AI フィードバックによる強化学習 (RLAIF) は、強力な事前トレーニング済み言語モデルの命令追従能力を向上させるための一般的なパラダイムです。
RLAIF は、まず教師モデルのデモンストレーションを使用して教師あり微調整 (SFT) を実行し、次に批評家モデルからのフィードバックを使用して強化学習 (RL) でモデルをさらに微調整します。
最近人気のあるオープンソースモデルでは、RL ステップによるパフォーマンスの大幅な向上が実証されていますが、このホワイトペーパーでは、この RL ステップの複雑さが AI フィードバックに本当に正当であるかどうかについて疑問を抱いています。
RLステップの改善は事実上完全に、AIフィードバック生成に使用される批評家モデル（例：GPT-4）よりも弱い教師モデル（例：GPT-3.5）をSFTデータ収集に使用するという広範な慣行によるものであることを示します。
具体的には、教師として GPT-4 を使用した単純な教師あり微調整が、既存の RLAIF パイプラインよりも優れていることを示します。
より一般的には、RLAIF からの利益は、基本モデルファミリ、テスト時評価プロトコル、および批評家モデル間で大幅に異なることがわかります。
最後に、SFT が完全な 2 ステップ RLAIF パイプラインよりも優れたパフォーマンスを発揮する可能性がある場合のメカニズムの説明と、実際に RLAIF を最大限に活用するための提案を提供します。

要約(オリジナル)

Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF maximally useful in practice.

arxiv情報

著者	Archit Sharma,Sedrick Keh,Eric Mitchell,Chelsea Finn,Kushal Arora,Thomas Kollar
発行日	2024-02-19 18:53:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Critical Evaluation of AI Feedback for Aligning Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー