Token Level Routing Inference System for Edge Devices

要約

大規模な言語モデル（LLM）推論の計算の複雑さは、エッジデバイスでの展開効率を大幅に制約します。
対照的に、小言語モデルは、より速いデコードとリソースの消費量の削減を提供しますが、多くの場合、応答の品質が低下し、幻覚に対する感受性が高まっています。
このトレードオフに対処するために、大規模なモデルが重要なトークンの生成を支援する共同デコードが有望なソリューションとして浮上しています。
このパラダイムは、より小さなモデルの速度と効率を維持しながら、大きなモデルの選択的介入を通じて高品質の推論を可能にすることにより、両方のモデルタイプの強度を活用します。
この作業では、小さなモデルが重要なトークン生成のためのクラウドベースの大規模モデルを選択的に相談しながら、小さなモデルがデバイス上の推論を実行できるようにする新しい共同デコード推論システムを提示します。
驚くべきことに、このシステムは、M1 MacBookの0.5Bモデルのみを使用してCommonsenseQAで60％のパフォーマンス増加を達成し、トークンの生成の7％未満がクラウドの大規模モデルにアップロードされます。

要約(オリジナル)

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

arxiv情報

著者	Jianshu She,Wenhao Zheng,Zhengzhong Liu,Hongyi Wang,Eric Xing,Huaxiu Yao,Qirong Ho
発行日	2025-04-10 15:54:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Token Level Routing Inference System for Edge Devices

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー