Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

要約

大規模な言語モデル（LLMS）は、高度な推論、長型のコンテンツ生成、およびツールの使用を通じて、ますます複雑なタスクに取り組む能力を実証しています。
これらのタスクを解決するには、多くの場合、長い推論時間計算が含まれます。
人間の問題解決では、作業を促進する一般的な戦略はコラボレーションです。問題をサブタスクに分割する、同時にさまざまな戦略を探求するなど。最近の研究では、LLMが投票メカニズムや並行して実行できる独立したサブタスクの明示的な作成など、明示的な協力フレームワークを実装することで並行して動作できることが示されています。
ただし、これらの各フレームワークは、すべてのタイプのタスクに適していない場合があり、適用性を妨げる可能性があります。
この作業では、別の設計アプローチを提案します。LLM「ワーカー」を並行して実行し、同時にアップデートされた注意キャッシュを介して同期することができ、これらのワーカーに協力の最善の方法を決定するように促します。
私たちのアプローチにより、インスタンスは、手元の問題のための独自のコラボレーション戦略を考え出すことができます。
HogWildを介してこのアプローチを実装します！
推論：同じ注意キャッシュと同じLLMの複数のインスタンスが並列で実行され、互いの生成されたトークンへの「インスタント」アクセスを伴う並列LLM推論エンジン。
ホグリド！
推論では、回転位置の埋め込み（ロープ）を利用して、並列ハードウェアの使用率を改善しながら再計算を避けます。
現代の推論対応LLMは、追加の微調整なしで、共有キー価値キャッシュを箱から出して推測を実行できることがわかります。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM ‘workers’ in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while ‘seeing’ each other’s partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with ‘instant’ access to each other’s generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

arxiv情報

著者	Gleb Rodionov,Roman Garipov,Alina Shutova,George Yakushev,Vage Egiazarian,Anton Sinitsin,Denis Kuznedelev,Dan Alistarh
発行日	2025-04-09 17:56:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー