Deliberation in Latent Space via Differentiable Cache Augmentation - Paper Summary

Abstract

Techniques that enable large language models (LLMs) to “think more” by generating and attending to intermediate reasoning steps have shown promise in addressing complex tasks. However, standard methods often rely on sequentially producing streams of discrete tokens, which can create inefficiencies and complicate end-to-end optimization. In this work, we showcase how a frozen LLM can be augmented with an offline coprocessor that operates on the model’s key-value (kv) cache, improving subsequent decoding fidelity without direct alterations to the base model. We train this coprocessor on standard pretraining corpora (without reinforcement learning or specialized procedures), which allows it to learn how to inject additional computation back into the kv-cache in an end-to-end differentiable manner. Because our base decoder remains untouched, the coprocessor can run asynchronously, and the LLM still functions normally in the absence of the augmentation. Empirically, cache augmentation reduces perplexity on subsequent tokens and consistently enhances performance on a variety of reasoning-intensive benchmarks, such as GSM8K and MMLU. We emphasize that this approach realizes a new paradigm of deliberative reasoning in latent space—enabling the LLM to refine internal representations without imposing the overhead of generating step-by-step textual rationales at inference time.

Keywords: Latent Reasoning, Cache Augmentation, LLM, Coprocessor

1. Introduction

A substantial body of work highlights that large language models (LLMs) benefit from mechanisms allowing more extensive reasoning. Strategies such as Chain-of-Thought (Wei et al., 2022), Zero-shot CoT (Kojima et al., 2022), and search-based approaches (Wu et al., 2024) show that producing or searching over intermediate stepwise rationales leads to improved accuracy on complex tasks. Similarly, the capacity to adapt computational budgets, allocating more or less “thinking” as required, has been explored for efficiency (Schuster et al., 2022). Yet, there exists a conspicuous pattern: these strategies often rely on the LLM writing out discrete intermediate tokens (either explicitly or internally), typically in real time. This complicates large-scale training since discrete outputs are not seamlessly amenable to gradient-based optimization and, further, can lead to latency if many tokens of reasoning must be generated at inference.

Here, we propose a radically different pathway: we augment a frozen LLM with an external coprocessor that deliberates in latent space by directly operating on the model’s kv-cache. Rather than generating textual “thoughts,” this coprocessor synthesizes new latent embeddings that get inserted into the kv-cache. Because these embeddings live in the continuous space of the model’s internal states, they can be trained end-to-end using a standard language modeling loss—without requiring changes to the decoder. The language model then uses these latent embeddings for subsequent generation as though they were part of the context, but no discrete tokens need to be produced.

This mechanism is conceptually inspired by kv-cache compression (Ge et al., 2024; Mu et al., 2024), where external modules manipulate or compress the kv-cache for better efficiency. However, in our scenario, the emphasis is not on compression but on deliberative enhancement. The LLM’s memory (the kv-cache) is effectively “expanded” with new vectors that help the model produce more accurate predictions for the tokens that follow. We train the coprocessor on the same pretraining data used to train the LLM but keep the LLM weights frozen. The coprocessor is thus forced to learn to produce beneficial latent augmentations under the standard cross-entropy objective.

Our approach brings notable advantages:

End-to-End Differentiability
The entire pipeline—coprocessor plus decoder—can be trained end-to-end by backpropagating through the kv-cache manipulations, yet the decoder’s parameters remain untouched. This enables straightforward optimization without complex discrete RL steps.
Asynchronous Operation
Because the base transformer is not updated, the coprocessor can run asynchronously, offline, or even on specialized hardware. This is a sharp contrast to typical CoT methods that must produce reasoning tokens inline during inference. Moreover, if the coprocessor is missing, the frozen LLM simply reverts to standard behavior.
Improved Reasoning
Experimental results reveal improved perplexity and strong gains on reasoning-heavy tasks. Remarkably, the effect extends multiple tokens beyond the injection point, suggesting that these latent embeddings can “influence” the model’s subsequent chain of generation effectively.

We provide extensive empirical evaluations on the Gemma-2 (Team-Gemma et al., 2024) 2B-parameter LLM, demonstrating consistent perplexity reductions and performance boosts on benchmarks such as GSM8K, MMLU, DROP, ARC, and more. These gains typically increase with the number of latent embeddings we insert—indicating that more extensive latent “thinking” reaps better outcomes.

The remainder of this article is organized as follows. Section 2 carefully describes our method, including how the coprocessor is trained, how kv-cache augmentation is performed, and how we structure the training data. Section 3 then presents empirical results. Section 4 discusses related work on chain-of-thought, latent space reasoning, kv-cache manipulation, external modules, and hypernetworks. Finally, Section 5 summarizes our conclusions and future directions.

Deliberation in Latent Space via Differentiable Cache Augmentation

2. Methodology

2.1. Problem Statement

We adopt a frozen LLM parameterized by θ\thetaθ. Given an input sequence xxx and a target output yyy, our goal is to learn an auxiliary coprocessor ϕ\phiϕ. During inference, the LLM processes xxx and produces a kv-cache (K1:m,V1:m)\big(K_{1:m}, V_{1:m}\big)(K1:m,V1:m). Then, the coprocessor ϕ\phiϕ reads this kv-cache and generates latent embeddings zzz:ϕ(K1:m,V1:m) ⟶ z.\phi \Big(K_{1:m}, V_{1:m}\Big) \;\longrightarrow\; z.ϕ(K1:m,V1:m)⟶z.

We then append zzz to the kv-cache, effectively augmenting (K1:m,V1:m)\big(K_{1:m}, V_{1:m}\big)(K1:m,V1:m) to (K1:m∗,V1:m∗)\big(K_{1:m}^*, V_{1:m}^*\big)(K1:m∗,V1:m∗). Finally, the frozen LLM decodes from this augmented cache to predict yyy. The training objective is the standard language modeling likelihood:max⁡ϕ E[log⁡Pθ(y∣x,z)],\max_{\phi} \; \mathbb{E} \big[\log P_{\theta}(y \mid x, z)\big],ϕmaxE[logPθ(y∣x,z)],

where only ϕ\phiϕ is updated, while θ\thetaθ remains untouched.

2.2. Model Architecture

Concretely, we instantiate the coprocessor with the same architecture as the pretrained LLM (i.e., a Transformer of similar size), but only the coprocessor’s parameters are finetuned. Figure 1 (in the user-provided excerpt) illustrates the flow:

KV-Cache Generation
We first feed the input sequence xxx into the frozen LLM, obtaining the kv-cache (K1:m,V1:m)\big(K_{1:m}, V_{1:m}\big)(K1:m,V1:m).
Augmentation
This kv-cache is then passed to the coprocessor ϕ\phiϕ. Simultaneously, we provide ϕ\phiϕ with a handful of trainable soft tokens—continuous embeddings that the coprocessor can treat as input prompts. The coprocessor merges the original kv-cache with these extra embeddings to produce the latent embeddings zzz.
Decoder Generation
We append zzz into the kv-cache. Now, the LLM sees an expanded memory. When continuing to decode tokens, it has “awareness” of these newly introduced latent embeddings.

Because the LLM itself is frozen, the synergy between ϕ\phiϕ and the LLM must be established solely through adjusting ϕ\phiϕ. Through backpropagation of the language modeling loss, ϕ\phiϕ gradually learns how best to fill in these latent embeddings so that, for subsequent tokens, the LLM’s predictions are more accurate.

2.3. Pretraining Setup: Multi-Position Augmentation

To train ϕ\phiϕ in a scalable way, we collect large sequences from the LLM’s original pretraining corpus (2 trillion tokens for Gemma-2) and randomly select multiple positions to insert these latent embeddings. As illustrated in Figure 2 of the user-provided excerpt, suppose we have text:”a b c d e f”\text{“a b c d e f”}”a b c d e f”

We select positions—say after token “b” and token “d”—and insert a certain number of latent embeddings (e.g. b′,b′′b’, b”b′,b′′ and d′,d′′d’, d”d′,d′′) there. We then train the model to predict a set of upcoming tokens (the “ahead tokens”). For instance, for the insertion after “b,” we might train it to predict “c” and “d.” Simultaneously, for the insertion after “d,” we train it to predict “e” and “f.” We apply a custom attention mask so that each set of latent embeddings only sees the tokens preceding it, not the tokens after. This entire structure is processed in a single forward pass, enabling efficient training.

We choose hyperparameters:

Number of Latent Embeddings ℓ\ellℓ: the number of embeddings inserted at each augmentation point.
Number of Ahead Tokens aaa: how many future tokens we want the coprocessor to help predict.

By training in this teacher-forcing style across a massive corpus, the coprocessor learns a general ability: to read partial contexts from the kv-cache, generate helpful latent embeddings, and thus improve the LLM’s future predictions. This yields a robust mechanism for “latent thinking” that does not rely on discrete token generation.

3. Experiments

We implement our approach on the Gemma-2 2B model (Team-Gemma et al., 2024), a decoder-only Transformer pretrained on 2 trillion tokens of English-dominant text (web, code, etc.). During coprocessor training:

The base LLM is frozen.
The coprocessor is trained on the same pretraining corpus for 100k steps.
We typically use 16 ahead tokens (a=16a = 16a=16) and up to 128 random augmentation positions per sequence.

Below, we present perplexity evaluations and results on public benchmarks. Crucially, no additional task-specific fine-tuning is done for these benchmarks—only a straightforward single call to the coprocessor at the end of each prompt to produce the latent embeddings.

3.1. Perplexity Evaluation

First, we measure how the augmented model performs on standard next-token prediction. Specifically, for each position in a validation sequence, we insert ℓ\ellℓ latent embeddings, then measure perplexity on the kkk-th token after the insertion (for k=1,2,…,32k = 1, 2, \ldots, 32k=1,2,…,32). Figure 3 in the user-provided excerpt shows the training curves for the 1st and 32nd tokens after augmentation. Across various ℓ\ellℓ values (8, 16, 32, 64), perplexity consistently drops below the baseline frozen LLM. Larger ℓ\ellℓ yields greater improvements.

Table 1 (in the user-provided excerpt) shows the relative perplexity reductions at multiple positions up to the 32nd token after the insertion. The improvement remains visible even at position 32, though the magnitude diminishes as we move further away. Nonetheless, these results confirm that the coprocessor’s embeddings do meaningfully alter the LLM’s internal representations, leading to more accurate predictions well beyond the immediate next token.

3.2. Public Benchmark Evaluation

We also test on a collection of well-known benchmarks:

MMLU (5-shot)
GSM8K (8-shot)
DROP (3-shot F1)
ARC-easy / ARC-challenge (0-shot)
MATH (4-shot)
Winogrande (0-shot)
PIQA (0-shot)
SIQA (0-shot)
HellaSwag (0-shot)
BoolQ (0-shot)
MBPP (3-shot)
AGIEval (3–5-shot)
TriviaQA (5-shot)
Natural Questions (NQ) (5-shot)
HumanEval (pass@1)
BBH (3-shot)

We insert a single batch of latent embeddings right after the prompt and before the model generates the answer. Table 2 (in the user-provided excerpt) summarizes the results with ℓ∈{4,8,16,32,64}\ell \in \{4, 8, 16, 32, 64\}ℓ∈{4,8,16,32,64}. The trend is quite consistent: on tasks requiring nontrivial reasoning, we see steady gains as ℓ\ellℓ increases. For instance:

GSM8K jumps from 21.38% (baseline) to 31.43% with 64 embeddings (+10.05%).
MMLU from 52.00% to 56.70% (+4.70%).
ARC-challenge from 50.26% to 54.44% (+4.18%).

Some tasks see smaller or more moderate lifts, but importantly, the general picture is that learned latent augmentation helps across diverse tasks—without specialized tuning.

3.3. Comparison with Other Methods

3.3.1. Pause Token Baseline

In Pause Token (Goyal et al., 2023), the authors insert trainable embeddings between the input and output sequences to encourage “latent thinking.” However, those pause embeddings are not context-dependent; they do not read the kv-cache. Table 3 (from the excerpt) contrasts our approach with Pause Token using 32 embeddings. On the validation set, we measure perplexity for the first token after the embeddings:

Baseline perplexity = 10.96
Pause Token perplexity = 11.63
Our method’s perplexity = 10.60

We also compare GSM8K 8-shot accuracy:

Baseline = 21.38%
Pause Token = 22.37%
Ours = 26.76%

Hence, leveraging dynamic, input-conditioned embeddings via a coprocessor is significantly more potent than static embeddings.

3.3.2. Zero-Shot Chain-of-Thought (CoT)

Another strong baseline is zero-shot CoT (Kojima et al., 2022), which simply appends a phrase like “Let’s think step by step” to prompt the LLM to produce an intermediate rationale before the final answer. While beneficial, it triggers the LLM to generate tokens sequentially. In Table 4, for GSM8K 8-shot:

Baseline = 21.38%
Zero-shot CoT = 23.20%
Ours (16 latent embeddings) = 24.72%
Ours (32 latent embeddings) = 26.76%

We outperform zero-shot CoT by a noticeable margin, and we do so with a single forward pass of the coprocessor, avoiding the overhead of step-by-step textual expansions.

3.3.3. Alternative Coprocessor Configurations

Training Coprocessor from Scratch
Instead of initializing ϕ\phiϕ with the LLM’s weights, we tried random initialization. Although it improves over baseline, it underperforms our standard approach (which uses pretrained initialization). Figure 4 and Table 7 (in the Appendix) show that pretraining the coprocessor’s weights on the same distribution as the LLM fosters stronger synergy.
LoRA Finetuning
We also tested LoRA (Hu et al., 2021) as a parameter-efficient approach. Specifically, we freeze the LLM or partial layers of the coprocessor, and only train low-rank adapters. Table 5 demonstrates that while LoRA improves over baseline (e.g., GSM8K from 21.38% to 24.03% with rank 128), it still falls short of full finetuning (26.76%). LoRA is, however, memory-efficient.
Last Layer Activations vs. KV-Cache
One might feed the final layer hidden states of the LLM into ϕ\phiϕ, instead of the entire kv-cache. This is less effective: perplexity is higher, and GSM8K accuracy is lower. We hypothesize that the kv-cache contains more comprehensive contextual signals from multiple layers.

3.4. Impact of Number of Ahead Tokens

We vary a∈{4,8,16,32}a \in \{4, 8, 16, 32\}a∈{4,8,16,32}. While bigger aaa can help with long-range predictions, it sometimes reduces accuracy on earlier tokens. Empirically, we find 16 is a sweet spot, as shown in Table 6, providing a well-balanced improvement.

4. Related Work

4.1. Chain-of-Thought Reasoning

A vast literature addresses how to coax LLMs into unveiling or using intermediate steps. Chain-of-Thought (Wei et al., 2022) spurred interest in instructing the model to produce multi-step rationales. Subsequent research (Kojima et al., 2022; Zhou et al., 2023) has devised zero-shot or minimal-prompt expansions, while others propose verifying or searching over multiple reasoning paths (Wang et al., 2022; Lightman et al., 2023; Wang and Zhou, 2024; Yao et al., 2024). Despite their success, these methods typically generate discrete tokens, adding latency, and often do not easily integrate with standard backpropagation. Our method sidesteps discrete expansions by operating purely on continuous states, preserving end-to-end differentiability.

4.2. Latent Space Reasoning

An emerging line of inquiry looks at LLMs’ hidden states for clues about implicit reasoning. For instance, Biran et al. (2024) identify latent chains in middle layers that represent multi-hop thinking. Shalev et al. (2024) reveals interpretability in mid-layer embeddings. Meanwhile, Pause Token (Goyal et al., 2023) and related works investigate “dummy” or “meta” tokens that nudge the model to do hidden reasoning. COCONUT (Hao et al., 2024) encourages transformations within hidden states as a new reasoning framework. Our approach differs by dynamically generating these embeddings from the kv-cache, enabling truly input-conditioned latent expansions.

4.3. KV-Cache Compression

Methods like ICAE (Ge et al., 2024) or gist tokens (Mu et al., 2024) compress the kv-cache so that older memory can be stored more efficiently. Li et al. (2023) likewise remove redundant input. In contrast, we do not compress but augment the kv-cache. This is a key conceptual shift: rather than distilling existing information, we provide new latent embeddings that the frozen LLM can leverage.

4.4. Augmenting LLMs with External Modules

The broader field of parameter-efficient finetuning includes prefix tuning (Li and Liang, 2021), prompt tuning (Lester et al., 2021), adapters (Houlsby et al., 2019; Pfeiffer et al., 2023), LoRA (Hu et al., 2021), etc. These typically insert small modules or trainable prompts that the LLM attends to. Meanwhile, vision-language models like Flamingo (Alayrac et al., 2022) or PaLI (Chen et al., 2022) add external cross-attention for images. Our approach conceptually aligns with modular design but specializes an entire Transformer (the coprocessor) to transform the kv-cache itself—which is unusual, as typical adapters do not manipulate kv-cache states in such a direct manner.

4.5. Hypernetworks

A hypernetwork (Ha et al., 2017) learns to generate weights or modules conditioned on a context input. Several authors (Ponti et al., 2021; He et al., 2022; Ansell et al., 2021; Mahabadi et al., 2021) have used hypernetworks for adapter generation or continuous prompts. In that sense, our coprocessor is reminiscent of a hypernetwork: it reads the kv-cache (which encodes context) and produces new embeddings. But instead of generating neural weights, it directly emits latent tokens that are slotted back into the model’s memory.

5. Conclusion

We present differentiable cache augmentation, a new framework for extending a frozen LLM’s capabilities by introducing a learned coprocessor that operates in latent space. Rather than injecting discrete tokens or rewriting the LLM’s architecture, we insert a continuous embedding—synthesized from the kv-cache—back into the model’s memory. This design admits seamless end-to-end training using standard language modeling losses while maintaining the base LLM’s parameters in a fully frozen state. Experimental results show consistent perplexity reductions and significant improvements on a suite of challenging tasks. Critically, the coprocessor can be used asynchronously, allowing for more flexible—and potentially more powerful—forms of offline reasoning.

Our results only scratch the surface of what might be possible by rethinking LLM “deliberation” as an offline, latent-space process. Future directions are myriad:

Scaling to bigger base LLMs and more robust coprocessors.
Multi-Coprocessor setups, where different specialized modules handle arithmetic, code interpretation, or multi-lingual tasks.
Task-Specific training in a more direct manner (e.g., using few shot or RL signals).
Exploring how an LLM might iterate multiple times over its own kv-cache, refining reasoning footprints without ever producing explicit textual steps.

Ultimately, Deliberation in Latent Space may offer an appealing compromise between the interpretability of chain-of-thought approaches and the raw efficiency of next-token prediction—one that harnesses the best of both by allowing the model to “think” in a manner that is systematically optimizable, memory-friendly, and flexible.

References and Further Reading

Below are direct links and references to cited works. All sources are genuine and correspond to publicly available materials:

arXiv Paper (this work):
Deliberation in Latent Space via Differentiable Cache Augmentation (2024)
Chain-of-Thought:
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., et al. (2022). “Chain-of-thought prompting elicits reasoning in large language models.” NeurIPS.
Zero-shot CoT:
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). “Large language models are zero-shot reasoners.” NeurIPS.
Gemma-2:
Team-Gemma, Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., et al. (2024). “Gemma 2: Improving open language models at a practical size.” arXiv.
KV-Cache Compression:
Mu, J., Li, X., & Goodman, N. (2024). “Learning to compress prompts with gist tokens.” NeurIPS.
Pause Token:
Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., & Nagarajan, V. (2023). “Think before you speak: Training language models with pause tokens.” arXiv.
LoRA:
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). “LoRA: Low-rank adaptation of large language models.” arXiv.
Quiet-STaR:
Zelikman, E., Harik, G., Shao, Y., Jayasiri, V., Haber, N., & Goodman, N. D. (2024). “Quiet-STAR: Language models can teach themselves to think before speaking.” arXiv.
COCONUT:
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., & Tian, Y. (2024). “Training large language models to reason in a continuous latent space.” arXiv.
MATH:
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). “Measuring mathematical problem solving with the MATH dataset.” NeurIPS.
Other Mentioned Works:
- Schuster, T., Fisch, A., Gupta, J., et al. (2022). Confident Adaptive Language Modeling. NeurIPS.
- Wang, X., Wei, J., Schuurmans, D., et al. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171.
- Lightman, H., Kosaraju, V., Burda, Y., et al. (2023). Let’s verify step by step. arXiv:2305.20050.
- Pfeiffer, J., Ruder, S., Vulić, I., & Ponti, E. M. (2023). Modular deep learning. Transactions of Machine Learning Research.
- Li, X., & Liang, P. (2021). “Prefix-tuning: Optimizing continuous prompts for generation.” arXiv:2101.00190.
- Lester, B., Al-Rfou, R., & Constant, N. (2021). “The power of scale for parameter-efficient prompt tuning.” arXiv:2104.08691.