d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

要約

最近の大規模な言語モデル（LLMS）は、オンライン強化学習（RL）の恩恵を受ける強力な推論能力を実証しています。
これらの機能は、主に左から右への自己回帰（AR）世代のパラダイム内で実証されています。
対照的に、拡散に基づいた非自動性パラダイムは、粗から洗練された方法でテキストを生成します。
最近の拡散ベースの大手言語モデル（DLLM）は、ARの対応物と比較して競争力のある言語モデリングパフォーマンスを達成していますが、DLLMがLLM推論の最近の進歩を活用できるかどうかは不明のままです。
この目的のために、D1を提案します。D1を提案します。D1は、訓練されたマスクされたDLLMSを、監視されたFinetuning（SFT）とRLの組み合わせを介して推論モデルに適応させることを提案します。
具体的には、事前に処理されたDLLMの推論を改善するための手法を開発および拡張します。（a）マスクされたSFT技術を利用して、既存のデータセットから知識を蒸留し、自己改善の行動を直接浸透させ、（b）Diffu-Grpoと呼ばれる新しい批評家の勾配ベースのRLアルゴリズムを紹介します。
実証研究を通じて、複数の数学的および論理的推論ベンチマークに関するさまざまなトレーニング後のレシピのパフォーマンスを調査します。
D1が最高のパフォーマンスをもたらし、最先端のDLLMのパフォーマンスを大幅に向上させることがわかります。

要約(オリジナル)

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

arxiv情報

著者	Siyan Zhao,Devaansh Gupta,Qinqing Zheng,Aditya Grover
発行日	2025-04-16 16:08:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー