A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

要約

継続的なトレーニングのための高品質の推論データの合成は、大規模な言語モデル（LLM）のパフォーマンスを向上させるのに効果的であることが証明されています。
ただし、以前の合成アプローチは、データを簡単にスケールアップし、高品質を追求するために高いコストを負担するのに苦労しています。
この論文では、高品質の推論データ合成のための経済的でスケーラブルなフレームワークであるグラフベースの合成データパイプライン（GSDP）を提案します。
ナレッジグラフに触発されて、シードデータから知識ポイントを抽出し、知識ポイント関係グラフを構築して相互接続を調査しました。
知識間の暗黙の関係を調査することにより、私たちの方法は$ 255のデータ拡張を達成します。
さらに、オープンソースモデルが率いるGSDPは、GPT-4-0613に匹敵する合成品質を達成し、100ドルの低コストを維持します。
最も挑戦的な数学的推論タスクに取り組むために、191万ペアを超える数学の問題と回答で構成されるGSDP-Mathデータセットを提示します。
GSDP-MATHで微調整した後、Mistral-7Bに基づくGSDP-7Bは、数学で37.7％の精度、GSM8Kで78.4％を達成し、方法の有効性を実証します。
データセットとモデルは、https：//github.com/jayce1kk/gsdpでリリースされます。

要約(オリジナル)

Synthesizing high-quality reasoning data for continual training has been proven to be effective in enhancing the performance of Large Language Models (LLMs). However, previous synthetic approaches struggle to easily scale up data and incur high costs in the pursuit of high quality. In this paper, we propose the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable framework for high-quality reasoning data synthesis. Inspired by knowledge graphs, we extracted knowledge points from seed data and constructed a knowledge point relationships graph to explore their interconnections. By exploring the implicit relationships among knowledge, our method achieves $\times$255 data expansion. Furthermore, GSDP led by open-source models, achieves synthesis quality comparable to GPT-4-0613 while maintaining $\times$100 lower costs. To tackle the most challenging mathematical reasoning task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating the effectiveness of our method. The dataset and models will be released at https://github.com/Jayce1kk/GSDP.

arxiv情報

著者	Jiankang Wang,Jianjun Xu,Xiaorui Wang,Yuxin Wang,Mengting Xing,Shancheng Fang,Zhineng Chen,Hongtao Xie,Yongdong Zhang
発行日	2025-04-11 05:27:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー