Chain-of-Thought Reasoning is a Policy Improvement Operator

要約

大規模な言語モデルは、魅力的な新機能で世界を驚かせました。
しかし、彼らは現在、新しいスキルを独学する能力に欠けており、代わりに人間が生成した大量のデータに基づいてトレーニングされることに頼っています。
言語モデルが思考連鎖推論を使用して新しいスキルをうまく学習できるという概念実証のデモンストレーションである SECToR (思考連鎖推論による自己教育) を紹介します。
強化学習 (Silver et al., 2017) と人間の認知 (Kahneman, 2011) の両方に関する以前の研究に触発された SECToR は、まず思考連鎖推論を使用して問題をゆっくりと検討します。
次に、SECToR はモデルを微調整して、今回は思考連鎖推論を使用せずに同じ答えを生成します。
SECToR を介してトレーニングされた言語モデルは、6 桁以下の数値のみで構成される最初の教師付き微調整フェーズを超えると、グラウンドトゥルースの例にアクセスすることなく、最大 29 桁の数値の加算を自律的に学習します。
私たちの中心的な仮説は、AlphaZero でモンテカルロ木検索がどのように使用されているかと同様に、思考連鎖推論がポリシー改善演算子として機能する可能性があるということです。
私たちは、この研究が、人間によるデモンストレーションを必要とせずに言語モデルが自ら学習できるという新たな方向性につながることを願っています。

要約(オリジナル)

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on being trained on large amounts of human-generated data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can successfully teach themselves new skills using chain-of-thought reasoning. Inspired by previous work in both reinforcement learning (Silver et al., 2017) and human cognition (Kahneman, 2011), SECToR first uses chain-of-thought reasoning to slowly think its way through problems. SECToR then fine-tunes the model to generate those same answers, this time without using chain-of-thought reasoning. Language models trained via SECToR autonomously learn to add up to 29-digit numbers without any access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, analogously to how Monte-Carlo Tree Search is used in AlphaZero. We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

arxiv情報

著者	Hugh Zhang,David C. Parkes
発行日	2023-09-15 17:44:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Chain-of-Thought Reasoning is a Policy Improvement Operator

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー