Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

要約

大規模言語モデル(LLM)は目覚ましい成功を収めたが、特に限られた独自のデータを持つ小規模で専門的なコーパスから学習する場合、依然としてデータ効率が悪い。事前学習を継続するための既存の合成データ生成手法は、文書内のコンテンツに焦点を当て、文書間の知識関連を見落としており、コンテンツの多様性と深さを制限している。我々はSynthetic-on-Graph (SoG)を提案する。SoGは、効率的なコーパス拡張のために、文書間の知識関連を組み込んだ合成データ生成フレームワークである。SoGは、元のコーパスからエンティティや概念を抽出し、文書間の関連性を表現し、知識関連サンプリングのためのグラフウォーク戦略を採用することで、コンテキストグラフを構築する。これにより、合成データの多様性と一貫性が強化され、モデルが複雑な知識構造を学習し、希少な知識を扱えるようになる。合成データの質をさらに向上させるために、我々はChain-of-Thought（CoT）とContrastive Clarifying（CC）を統合し、推論プロセスと識別力を強化する。実験によれば、SoGはマルチホップ文書Q&Aデータセットにおいて最先端手法（SOTA）を凌駕する一方、読解タスクデータセットにおいてはSOTAと同等の性能を示し、SoGの優れた汎化能力を強調した。我々の研究は合成データ生成を進歩させ、特にデータの利用可能性が限られた領域において、LLMにおける効率的な知識獲得のための実用的な解決策を提供する。

要約(オリジナル)

Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document Q&A dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.

arxiv情報

著者	Xuhui Jiang,Shengjie Ma,Chengjin Xu,Cehao Yang,Liyu Zhang,Jian Guo
発行日	2025-05-02 03:40:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー