Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

要約

オートレーリングの大型言語モデル（AR-LLM）は、順次生成において暗黙的な並列性を頻繁に示します。
これに触発されて、私たちは、ネイティブに平行な生成を可能にする新しい生成モデルであるMultiverseを紹介します。
マルチバースは、MapReduceパラダイムを内面化し、3つの段階で自動的に生成します。（i）適応タスク分解のマップ段階、（ii）パラレルサブタスク実行のプロセス段階、（iii）ロスレス結果合成の削減段階。
次に、データ、アルゴリズム、およびシステムの共同設計を備えた実際の多元恒例の推論モデルを構築し、フロンティアAR-LLMSからの迅速かつシームレスな転送を可能にします。
シーケンシャル推論チェーンから始めて、自動化されたLLM支援パイプラインを使用して構造化されたトレーニングデータに変換し、高価な人間の注釈を回避することにより、マルチバース1Kを作成します。
アルゴリズム的には、効率的なトレーニングのために因果関係と互換性を維持しながら、多元宇宙の注意を別々の並列推論ステップに設計します。
体系的には、並列推論を有効にするためにマルチバースエンジンを実装します。
モデルによって直接トリガーされるシーケンシャルとパラレルの生成を動的に切り替える専用のスケジューラを備えています。
1Kの例で3時間の微調整を行った後、私たちの多元宇宙-32Bは、それぞれ同じスケールの主要なAR-LLMと同等のパフォーマンスを達成する唯一のオープンソースの非ARモデルとして、それぞれ54％と46％の54％と46％のスコアによって証明されます。
さらに、当社の予算管理実験は、マルチバース-32Bが優れたスケーリングを示し、同じコンテキスト長を使用して平均で1.87％を上回るAR-llMを上回ることを示しています。
このようなスケーリングはさらに実用的な効率の向上につながり、さまざまなバッチサイズで最大2倍の高速化を達成します。
データ、モデルの重み、エンジン、サポートツール、完全なデータキュレーションのプロンプト、詳細なトレーニングと評価レシピなど、多元宇宙エコシステム全体をオープンソースしました。

要約(オリジナル)

Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only open-sourced non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 54% and 46%, respectively. Moreover, our budget control experiments show that Multiverse-32B exhibits superior scaling, outperforming AR-LLMs by 1.87% on average using the same context length. Such scaling further leads to practical efficiency gain, achieving up to 2x speedup across varying batch sizes. We have open-sourced the entire Multiverse ecosystem, including data, model weights, engine, supporting tools, as well as complete data curation prompts and detailed training and evaluation recipes.

arxiv情報

著者	Xinyu Yang,Yuwei An,Hongyi Liu,Tianqi Chen,Beidi Chen
発行日	2025-06-11 17:59:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー