Deliberation in Latent Space via Differentiable Cache Augmentation

要約

中間の推論ステップを生成して処理することで大規模言語モデル (LLM) が「さらに考える」ことを可能にする技術は、複雑な問題の解決に有望であることが示されています。
ただし、標準的なアプローチでは、応答する直前に個別のトークンのシーケンスが生成されるため、多大な遅延コストが発生する可能性があり、最適化が困難です。
この研究では、モデルのキー/値 (kv) キャッシュ上で動作するオフラインコプロセッサーを使用して、フリーズされた LLM を拡張できることを示します。
このコプロセッサは、後続のデコードの忠実度を向上させるために設計された一連の潜在的な埋め込みでキャッシュを強化します。
デコーダ自体をフリーズさせたまま、標準の事前トレーニングデータでデコーダからの言語モデリング損失を使用してこのコプロセッサをトレーニングします。
このアプローチにより、モデルはエンドツーエンドの微分可能な方法で、追加の計算を kv キャッシュに抽出する方法を学習できるようになります。
デコーダは変更されないため、コプロセッサはオフラインかつ非同期で動作でき、コプロセッサが利用できない場合、または特定のキャッシュが追加の計算を必要としないとみなされる場合、言語モデルは正常に機能します。
キャッシュが拡張されると、デコーダーは後続の多数のトークンでより低いパープレキシティを達成することを実験的に示します。
さらに、タスク固有のトレーニングがなくても、キャッシュ拡張により一貫して混乱が軽減され、推論集中型のさまざまなタスクにわたってパフォーマンスが向上することが実験で実証されました。

要約(オリジナル)

Techniques enabling large language models (LLMs) to ‘think more’ by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model’s key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.

arxiv情報

著者	Luyang Liu,Jonas Pfeiffer,Jiaxing Wu,Jun Xie,Arthur Szlam
発行日	2024-12-23 18:02:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deliberation in Latent Space via Differentiable Cache Augmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー