Chain-of-Thought Reasoning is a Policy Improvement Operator

要約

大規模な言語モデルは、魅力的な新機能で世界を驚かせました。
しかし、彼らは現在、新しいスキルを自分で学習する能力に欠けており、代わりに人間が生成した大量のトレーニングデータに依存しています。
言語モデルが思考連鎖推論を使用して新しいスキルを自分自身に学習できるという概念実証のデモンストレーションである SECToR (思考連鎖推論による自己教育) を紹介します。
自己学習ループ中に、SECToR は、次のバージョンのモデルをトレーニングする前に、思考連鎖推論を使用して加算問題を解決するようモデルに依頼し、そのような推論を使用せずに同じ問題を直接解決します。
このプロセスにより、改善されたモデルが得られることが多く、これを思考連鎖推論で再度強化すると、元のモデルよりもさらに困難な問題を解決できるようになり、自己学習ループの継続が可能になります。
SECToR を介してトレーニングされた言語モデルは、6 桁以下の数値のみで構成される最初の教師付き微調整フェーズを超えると、グラウンドトゥルースの例にアクセスすることなく、最長桁の数値まで加算することを自律的に学習します。
私たちの中心的な仮説は、AlphaZero でモンテカルロ木検索がどのように使用されているかと同様に、思考連鎖推論がポリシー改善演算子として機能する可能性があるということです (Silver et al., 2017)。
私たちは、この研究が、人間によるデモンストレーションを必要とせずに言語モデルが自ら学習できるという新たな方向性につながることを願っています。

要約(オリジナル)

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.

arxiv情報

著者	Hugh Zhang,David C. Parkes
発行日	2023-11-08 18:10:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Chain-of-Thought Reasoning is a Policy Improvement Operator

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー