Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

要約

大規模言語モデル (LLM) の中核となる機能の 1 つは、自然言語の命令に従うことです。
ただし、手動によるアノテーションを使用せずに、LLM の複雑な命令追従能力を強化するために高品質のトレーニングデータを自動的に構築するという問題は未解決のままです。
このペーパーでは、命令に従うトレーニングデータを自動的に生成するための、スケーラブルで信頼性の高い最初の方法である AutoIF を紹介します。
AutoIF は、命令に従うデータ品質の検証をコード検証に変換し、LLM に命令を生成すること、対応するコードを命令応答の正しさをチェックすること、および単体テストサンプルをコードの正しさを検証することを要求します。
次に、実行フィードバックベースの拒否サンプリングにより、教師あり微調整 (SFT) およびヒューマンフィードバックからの強化学習 (RLHF) トレーニング用のデータを生成できます。
AutoIF は、自己調整および強から弱への蒸留設定で、トップのオープンソース LLM である Qwen2 および LLaMA3 に適用すると、SFT、オフライン DPO、およびオンライン DPO の 3 つのトレーニングアルゴリズム全体で大幅な改善を達成します。
私たちのコードは https://github.com/QwenLM/AutoIF で公開されています。

要約(オリジナル)

One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code’s correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

arxiv情報

著者	Guanting Dong,Keming Lu,Chengpeng Li,Tingyu Xia,Bowen Yu,Chang Zhou,Jingren Zhou
発行日	2024-07-17 14:33:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー