Titans: Learning to Memorize at Test Time - Successor to the Transformer Architecture? Paper Summary

TLDR;

This paper introduces Titans, an architectural framework that combines short-term memory (via an attention mechanism) with a novel long-term neural memory to learn to memorize at test time, enabling better scalability and handling of extremely long contexts. The proposed memory module learns based on a “surprise” metric and can be trained in parallel via a carefully designed mini-batch gradient method. Experiments show that Titans surpass both standard Transformers and recent linear recurrent approaches across multiple tasks—demonstrating stronger recall, greater scalability, and better performance on reasoning and very long context tasks (e.g., context window larger than 2 million tokens).

Titans Download

Introduction

Overview and Motivation

In modern deep learning, the ability to handle long contexts while preserving crucial information has become a key challenge, especially for tasks such as language modeling, video understanding, and time-series forecasting. Attention-based architectures, such as standard Transformers (Vaswani et al. 2017), offer powerful in-context learning but incur a quadratic complexity in memory and compute. Linear recurrent methods scale more gracefully but struggle to capture and maintain crucial long-term signals. This work proposes a unifying view of short-term (attention-based) and long-term (neural-based) memory modules, integrated into a single architecture called Titans.

A central principle of this paper is that attention can be viewed as short-term memory—a mechanism that precisely retains local context—while a newly proposed long-term neural memory (NLM) acts as persistent storage for historical information. The authors design this memory to update itself at test time through a “surprise” metric, controlled by momentum and gating. Crucially, they also provide a strategy to train this memory module in parallel, avoiding naive sequential updates.

Key Ideas

Memory Perspectives
- Attention as Short-Term Memory: Powerful but context-limited, since storing large windows leads to quadratic overhead.
- Neural Memory as Long-Term Storage: A parameterized network that updates itself online (during inference), enabling recall of crucial details over an arbitrarily large time span.
Titans Architectures
- Combine a Core (attention with limited window) + Long-Term Neural Memory (trained on the fly) + Persistent Memory (task-specific, learnable parameters not tied to any single sample).
- Offer three variants for integrating memory (context integration, layer integration, or a gated branch).
Surprise-Based Memory Updates
- Inspired by the psychological observation that unexpected or novel stimuli are strongly encoded in human memory.
- The memory’s parameters are updated via a gradient-based approach that views large gradients as signals of surprise.
Fast, Parallelizable
- Even though the memory updates appear sequential, the paper’s mini-batch and chunking strategy shows how to make the updates occur in parallel using matrix multiplications.
Empirical Results
- Titans outperform or match Transformers on large-scale tasks, including multi-million-token contexts.
- They also surpass linear recurrent networks and specialized SSMs in recall-heavy “needle-in-haystack” tasks.

Background and Preliminaries

Notation

x∈RT×dinx \in \mathbb{R}^{T \times d_{\text{in}}}x∈RT×din denotes an input sequence (length TTT, embedding dimension dind_{\text{in}}din).
MMM is the long-term memory module, potentially a multi-layer perceptron (MLP).
Q,K,VQ, K, VQ,K,V are the query, key, and value matrices in attention.
MMM (sometimes also used) refers to the attention mask or gating.
S(m)S^{(m)}S(m) indicates the mmm-th segment of a sequence.

Transformers and Attention

Quadratic Attention

Standard Transformers (Vaswani et al. 2017) compute:Q=x WQ,K=x WK,V=x WV,Q = x\,W_Q,\quad K = x\,W_K,\quad V = x\,W_V,Q=xWQ,K=xWK,V=xWV, yt = ∑i=1Texp⁡ ⁣(Qt⊤Kidin)∑ℓ=1Texp⁡ ⁣(Qt⊤Kℓdin) Vi.y_t \;=\; \sum_{i=1}^{T} \frac{\exp\!\Bigl(\tfrac{Q_t^\top K_i}{\sqrt{d_{\text{in}}}}\Bigr)}{\sum_{\ell=1}^{T}\exp\!\Bigl(\tfrac{Q_t^\top K_\ell}{\sqrt{d_{\text{in}}}}\Bigr)} \;V_i.yt=i=1∑T∑ℓ=1Texp(dinQt⊤Kℓ)exp(dinQt⊤Ki)Vi.

The resulting O(T2)O(T^2)O(T2) complexity quickly becomes prohibitive as TTT grows.

Linear Attention

To improve efficiency, various “linear” or kernel-based attentions compute something like:Attn(Q,K,V)t = ϕ(Qt)⊤(∑i=1Tϕ(Ki)Vi⊤)ϕ(Qt)⊤(∑i=1Tϕ(Ki)),\mathrm{Attn}(Q, K, V)_t \;=\;\frac{\phi(Q_t)^\top \Bigl(\sum_{i=1}^{T} \phi(K_i)V_i^\top\Bigr)}{\phi(Q_t)^\top \Bigl(\sum_{i=1}^{T} \phi(K_i)\Bigr)},Attn(Q,K,V)t=ϕ(Qt)⊤(∑i=1Tϕ(Ki))ϕ(Qt)⊤(∑i=1Tϕ(Ki)Vi⊤),

where ϕ(⋅)\phi(\cdot)ϕ(⋅) is a feature map. This allows partial factorization of the attention matrix, reducing cost. However, such models essentially compress past tokens into a single matrix or vector, risking memory overflow when TTT is very large.

Modern Linear Recurrent Networks

Standard RNNs store information in a hidden state ht\mathbf{h}_tht. In a memory view:Mt = Write(Mt−1,xt),yt = Read(Mt,xt).\mathbf{M}_t \;=\;\mathrm{Write}(\mathbf{M}_{t-1}, x_t), \quad y_t \;=\;\mathrm{Read}(\mathbf{M}_t, x_t).Mt=Write(Mt−1,xt),yt=Read(Mt,xt).

Recent efforts improve the “Write” step with gating or delta rules. Yet storing extremely large contexts in a single M\mathbf{M}M remains problematic—it becomes a bottleneck as T→∞T\to \inftyT→∞.

Memory as a Driving Principle

The authors suggest that attention (short-term memory) and recurrent approaches (single hidden state) can be complementary. A complete system can house both a short-term mechanism for local context and a parametric memory for truly long-horizon needs. This framing sets the stage for Titans.

Long-Term Neural Memory: Learning to Memorize at Test Time

Surprise-Based Updates

The proposed neural memory M\mathbf{M}M is updated after each token using a gradient-based rule:Mt = Mt−1 − ηt ∇ℓ(Mt−1;xt),\mathbf{M}_t \;=\; \mathbf{M}_{t-1} \;-\;\eta_t \,\nabla \ell(\mathbf{M}_{t-1}; x_t),Mt=Mt−1−ηt∇ℓ(Mt−1;xt),

where ℓ\ellℓ is an associative memory objective:ℓ(Mt−1;xt) = ∥Mt−1(kt) − vt∥22,\ell\bigl(\mathbf{M}_{t-1}; x_t\bigr) \;=\;\|\mathbf{M}_{t-1}(k_t)\;-\;v_t\|_2^2,ℓ(Mt−1;xt)=∥Mt−1(kt)−vt∥22,

and ktk_tkt and vtv_tvt are the key and value of the current token. Large gradients indicate a mismatch between stored knowledge and the new input, implying surprise.

Momentum and Forgetting Gates

Momentum: Incorporate past “surprise” over multiple tokens:Δt = αt Δt−1 − βt ∇ℓ(Mt−1;xt),\Delta_t \;=\;\alpha_t\,\Delta_{t-1}\;-\;\beta_t\,\nabla \ell(\mathbf{M}_{t-1}; x_t),Δt=αtΔt−1−βt∇ℓ(Mt−1;xt), Mt = Mt−1 + Δt.\mathbf{M}_t \;=\;\mathbf{M}_{t-1} \;+\;\Delta_t.Mt=Mt−1+Δt.
Forgetting Gate: Use a gating variable γt∈[0,1]\gamma_t\in[0,1]γt∈[0,1]:Mt = (1−γt) Mt−1 + γt Δt.\mathbf{M}_t \;=\;(1-\gamma_t)\,\mathbf{M}_{t-1} \;+\;\gamma_t\,\Delta_t.Mt=(1−γt)Mt−1+γtΔt.

If γt≈1\gamma_t\approx1γt≈1, most old memory is overwritten. If γt≈0\gamma_t\approx0γt≈0, old memory is largely retained. This mechanism is crucial when the sequence has abrupt changes or when capacity is finite.

Deeper Memory Architectures

While one could let M\mathbf{M}M be a single matrix (or vector), the paper argues that a multi-layer MLP (or more advanced network) can store more complex relationships. During inference/read, a token’s query qtq_tqt is simply passed through M\mathbf{M}M:v^t = M∗(qt),\hat{v}_t \;=\;\mathbf{M}^*(q_t),v^t=M∗(qt),

where M∗\mathbf{M}^*M∗ denotes the forward pass of the updated memory, without further changes to parameters.

Parallelizable Training with Mini-Batches

Though these updates appear sequential, the authors show a chunked mini-batch scheme can compute them in parallel. Summations of gradients and gating products can be handled with matrix multiplications. This is key to scalability on GPU/TPU platforms.

Titans: Integrating Long-Term and Short-Term Memories

Titans is a blueprint that unifies:

Core (short-term): A limited-window attention.
Long-Term Neural Memory: The module described above, updated online.
Persistent Memory: A set of global, task-specific parameters that do not change with every token (e.g., a learned knowledge base).

Three Variants

Context Integration
- The retrieved memory v^t\hat{v}_tv^t is concatenated or merged into the token embeddings (keys/values) for short-term attention.
Layer Integration
- Each Titan block has a standard attention sub-layer and a memory sub-layer, potentially in series or parallel.
Gated Branch
- The output from the attention block is combined with the memory retrieval via a trainable gate, e.g. zt=σ(θ) Attnt+(1−σ(θ)) v^t.z_t = \sigma(\theta)\,\mathrm{Attn}_t + (1-\sigma(\theta))\,\hat{v}_t.zt=σ(θ)Attnt+(1−σ(θ))v^t.

Why It Works

Efficiency: The short-term attention window is kept manageable, avoiding O(T2)O(T^2)O(T2) blowups.
Recall: The memory can hold patterns from across the entire sequence.
Dynamic Control: Gating (or layering) ensures the model can shift emphasis between fresh local context and distant historical information.

Experimental Validation

1. Language Modeling

Setup: Standard corpora plus “needle-in-haystack” variations (where the relevant context is thousands of tokens away).
Results: Titans consistently outperform or match full-context Transformers, while using less memory for the attention window. They also beat linear recurrent models that rely on a single hidden state/matrix.

2. Commonsense Reasoning

Tasks: Reasoning benchmarks requiring multi-hop inference.
Findings: Titans yield better generalization, presumably due to their ability to keep “facts” or “context” in the neural memory across multiple steps.

3. Time-Series Forecasting

Challenge: Forecasting tasks often have extremely long horizons. Standard Transformers cannot handle huge TTT.
Outcome: Titans store relevant signals in the memory, gating out unhelpful patterns. This outperforms specialized approaches that rely on large gating or hidden states alone.

4. DNA/Genomics Modeling

Context: Genome sequences span millions of tokens, so typical attention windows are impossible to handle fully.
Observation: The memory gating triggers updates for surprising or novel DNA patterns, ignoring repetitive sequences. Titans thus handle million-token windows more gracefully than baselines.

Qualitative Insights

Surprise: High updates occur for rare tokens or abrupt topic changes.
Scalability: Titans can scale beyond 2M tokens by combining small-window attention with the neural memory.
Generalization: The memory’s meta-learning approach avoids simple overfitting to local patterns; it can adapt at inference time.

Broader Connections and Future Directions

Human Cognition Analogy

Dual-memory systems (short-term vs. long-term) mirror well-known models in cognitive science. Humans prioritize encoding of surprising experiences.

Meta-Learning

Updating weights at test time is reminiscent of meta-learning or “fast weights”. The new approach leverages advanced gating and parallelization for large-scale tasks.

Practical Considerations

Hardware Efficiency: Matrix-based parallel updates keep training feasible for massive datasets.
Hyperparameters: Tuning momentum and gating thresholds is essential.
Privacy & Memorization: Storing data at test time raises potential privacy issues. Mechanisms to forget or encrypt sensitive data may be necessary.

Toward Richer Memory Modules

Though Titans employ MLPs as memory, more sophisticated architectures could further enhance recall capacity and accuracy.

Detailed Summary of Each Section

In the following expanded discussion, we emphasize the distinctiveness and the rationale behind each design choice, ensuring that every piece of the paper’s argument is fleshed out thoroughly.

1. The Challenge of Extremely Long Contexts

Modern applications require handling input lengths in the millions. Transformers, while powerful, face:

O(T2)O(T^2)O(T2) scaling in compute and memory.
Potential GPU/TPU memory exhaustion.

Linearized Transformers compress all past tokens into a single state or matrix, risking loss of important details. Titans aim to fix this by pairing short-term local attention with a parametric memory for the broader context.

2. Psychological Underpinnings

From psychology, we know short-term memory (working memory) is limited but accurate over a small timescale, while long-term memory is expansive but updated more slowly. The idea of weighting “surprising” experiences more heavily aligns with real-world cognitive processes.

3. Associative Memory Loss

Using ∥M(kt)−vt∥2\|\mathbf{M}(k_t)-v_t\|^2∥M(kt)−vt∥2 to train M\mathbf{M}M is reminiscent of classical fast-weight approaches that store key-value associations directly in parameters. Instead of training M\mathbf{M}M purely on pre-collected data, Titans update it live, at inference time, reflecting genuine adaptivity.

4. Gradient-Based Surprise Metric

Gradient magnitude is a direct reflection of mismatch between model prediction and target. The authors argue this is a simple yet effective measure of novelty or surprise. Additionally, momentum extends the effect of a surprising event beyond a single time step, paralleling the idea of lingering focus or “attention residue.”

5. Data-Dependent Forgetting

When context shifts drastically, old information can become irrelevant or misleading. The gating function γt\gamma_tγt ensures that the memory can selectively retain or discard prior knowledge. Without forgetting, memory risk grows unbounded—leading to overwriting or overflow.

6. Multi-Layer vs. Single-Layer Memory

A linear memory can store only linear transformations of tokens, limiting expressiveness. By contrast, a multi-layer network can encode more nuanced functions. Empirical results confirm multi-layer memory consistently outperforms single-layer designs in tasks requiring advanced pattern recognition.

7. Parallelizing Memory Updates

Naive implementations would require a loop over tokens, each time performing a gradient step. This is slow and less hardware-friendly. The chunking strategy groups tokens into mini-batches, computing combined gradients that replicate sequential updates—but using large matrix multiplications.

8. Titans Architecture

Core: Standard self-attention with a limited context window, ensuring feasible O(T w)O(T\,w)O(Tw) or O(Tw)O(Tw)O(Tw) complexity for each segment.
Long-Term Neural Memory: A deep MLP that keeps track of surprising data across the sequence.
Persistent Memory: Encodes fixed knowledge about the task or domain (like a learned “knowledge base”).

9. Empirical Evidence

Language Modeling: Better perplexity on standard corpora and specialized tasks with extremely long-distance dependencies.
Commonsense Reasoning: Gains in multi-hop tasks, implying that the neural memory can store relevant context or facts.
Time-Series: Effective at capturing seasonal or multi-scale patterns that appear sporadically in long sequences.
Genomics: Successfully learns to ignore repetitive sequences while focusing on novel mutations or patterns.

10. Ablations

Removing momentum or gating degrades performance.
Decreasing the depth of the memory module reduces recall capacity.
Using purely linear memory for extremely large contexts leads to “memory saturation.”

11. Limitations

Large memory networks can add parameters. There’s a trade-off between memory capacity and parameter count.
Extreme domain shifts may require more dynamic gating or forgetting schedules.
Additional strategies might be needed to handle the privacy concerns of memorizing data at inference time.

12. Future Directions

The community might explore combining Titans with advanced memory modules from other domains.
The gating mechanism can be learned in more flexible ways (e.g., reinforcement learning or discrete gates).
The system could interface with external knowledge bases, bridging neural memory with symbolic data structures.

Conclusion

Titans seamlessly fuse short-term attention (for immediate context) and a novel long-term memory module (for persistent, large-scale recall). By learning to memorize at test time, Titans maintain high recall capacity over multi-million-token contexts, all while preserving efficient local operations. Key to this innovation is a surprise-based training rule plus gating and momentum, enabling the memory to store and retrieve crucial data with minimal overhead.

Extensive experiments on language modeling, commonsense reasoning, time-series forecasting, and genomics confirm the approach’s effectiveness. The proposed framework can readily scale to windows beyond 2 million tokens, underscoring the synergy between a short-term attention window and a robust, online-updatable memory.

In bridging local vs. global information, short vs. long-term storage, and feed-forward vs. recurrent paradigms, Titans hint at a path forward for next-generation sequence models—where memory is no longer a singular concept but a carefully orchestrated ensemble of modules.