Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

要約

大規模言語モデル (LLM) は、さまざまなアプリケーションにわたって優れた機能を発揮しますが、高い推論遅延、多額のトレーニングコスト、幻覚の生成などの大きな課題に直面しています。
大規模言語モデルと小規模言語モデル (SLM) 間の協調デコードは、投機的デコード、対照的デコード、エミュレータまたはプロキシの微調整などの方法を通じてこれらの問題を軽減する有望な戦略を示します。
しかし、そのようなコラボレーションの詳細、特に統一的な観点から見たものは、ほとんど解明されていないままです。
二重プロセス認知理論に触発されて、我々はこの論文で高速生成と低速生成 (FS-GEN) と呼ばれる統一フレームワークを提案します。
このフレームワーク内では、LLM (SLM と併用される場合もあります) はシステム 2 (低速で計画的) として分類され、独立した SLM はシステム 1 (高速で直観的) として指定されます。
私たちは、これらの協調的な方法論の包括的な分析を提供し、それらの共通特性を解明し、FS-GEN フレームワークを通じてシステム 2 とシステム 1 の差分知識能力に光を当てます。
私たちの調査結果は、さまざまな方法にわたって、協力的な対話のほんの一部 (ほとんどの場合、約 20% 未満) のみが必要であることを示しています。
システム 1 とシステム 2 間のこれらの相互作用は、パラメーター比率に関連するスケーリング則に準拠しており、予測可能なコラボレーションが可能になります。
さらに、特に不確実性の観点から、コラボレーションが最も効果的であることが判明する特定の条件を調査し、将来の最適化の取り組みを導く可能性のある新しい洞察を提供します。
私たちの研究は、システム 1 とシステム 2 の根本的な違いは次のトークンの予測の不確実性にあり、システム 1 をサポートするにはシステム 2 による介入が重要であることを強調しています。再現用コード: https://github.com/TsinghuaC3I/FS-
ゲン

要約(オリジナル)

Large Language Models (LLMs) exhibit impressive capabilities across various applications but encounter substantial challenges such as high inference latency, considerable training costs, and the generation of hallucinations. Collaborative decoding between large and small language models (SLMs) presents a promising strategy to mitigate these issues through methods including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning. However, the specifics of such collaborations, particularly from a unified perspective, remain largely unexplored. Inspired by dual-process cognitive theory, we propose a unified framework in this paper, termed Fast and Slow Generating (FS-GEN). Within this framework, LLMs (sometimes along with SLMs) are categorized as System 2 (slow and deliberate), while independent SLMs are designated as System 1 (fast and intuitive). We provide a comprehensive analysis of these collaborative methodologies, elucidating their common properties and shedding light on the differential knowledge capabilities of System 2 versus System 1 through the FS-GEN framework. Our findings indicate that only a small proportion of collaborative interactions (approximately less than 20\% in most instances) are necessary across various methods. These interactions between System 1 and System 2 conform to a scaling law related to the parameter ratios, enabling predictable collaboration. Furthermore, we explore the specific conditions under which collaboration proves most effective, particularly from an uncertainty perspective, offering novel insights that may guide future optimization efforts. Our research underscores that the fundamental distinction between System 1 and System 2 lies in the uncertainty of next token predictions, where interventions by System 2 are crucial to support System 1. Code for Reproduction: https://github.com/TsinghuaC3I/FS-GEN

arxiv情報

著者	Kaiyan Zhang,Jianyu Wang,Ning Ding,Biqing Qi,Ermo Hua,Xingtai Lv,Bowen Zhou
発行日	2024-10-23 15:23:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー