Less is More: Recursive Reasoning with Tiny Networks - Paper Summary

The artificial intelligence community just witnessed something extraordinary—and profoundly counterintuitive. A neural network with merely 7 million parameters has achieved what billion-parameter behemoths couldn’t: genuine abstract reasoning on some of the hardest puzzles designed to test machine intelligence. This isn’t incremental progress. It’s a paradigm shift that challenges everything we thought we knew about the relationship between model size and capability.

The paper “Less is More: Recursive Reasoning with Tiny Networks” by Alexia Jolicoeur-Martineau from Samsung SAIL Montréal introduces the Tiny Recursive Model (TRM), an architecture so elegantly simple it almost seems absurd. Yet this diminutive system achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2—benchmarks specifically designed to resist the brute-force memorization tactics that have propelled large language models to dominance. To put this in perspective: TRM outperforms Deepseek R1, o3-mini, and Gemini 2.5 Pro while using less than 0.01% of their parameters.

How? By recursing. By thinking iteratively. By doing what our own minds do when confronting genuinely novel problems: refining, reconsidering, and progressively improving our understanding through repeated passes over the same information.

The Hierarchical Reasoning Model: Biological Inspiration Meets Computational Reality

To understand TRM’s breakthrough, we must first examine its predecessor: the Hierarchical Reasoning Model (HRM), developed by researchers at Sapient Intelligence in Singapore. HRM represented a radical departure from the chain-of-thought (CoT) reasoning that dominates contemporary large language models.

Chain-of-thought prompting—the technique that powers systems like GPT-4 and Claude—forces models to externalize their reasoning process into sequential text tokens. “First, I’ll do this. Then, I’ll do that.” It’s reasoning as performance, reasoning as narration. And while it works, it’s fundamentally limited. A single incorrect token can derail the entire reasoning chain. The approach requires massive amounts of high-quality reasoning data. And it’s expensive—generating thousands of intermediate tokens for complex problems creates substantial latency and computational cost.

HRM took inspiration from neuroscience, specifically from how the human brain organizes computation hierarchically across cortical regions operating at different timescales. As TechTalks explains, “The brain sustains lengthy, coherent chains of reasoning with remarkable efficiency in a latent space, without constant translation back to language.”

The architecture featured two coupled recurrent modules: a high-level module for slow, abstract planning, and a low-level module for rapid, detailed computations. These modules operated through “hierarchical convergence”—the fast module would explore part of the problem space, settle on an intermediate solution, then the slow module would reflect, update the overall strategy, and reset the fast module with new direction.

With just 27 million parameters trained on approximately 1,000 examples, HRM achieved remarkable results. On Sudoku-Extreme puzzles, it reached 55% accuracy. On Maze-Hard pathfinding challenges, 75%. On ARC-AGI-1, 40%. These were problems where state-of-the-art chain-of-thought models achieved 0% accuracy—complete failure.

But HRM was complex. It relied on intricate biological arguments about temporal frequencies in brain oscillations. It invoked the Implicit Function Theorem and one-step gradient approximations to justify only backpropagating through a subset of its recursions. It required two separate networks operating at different hierarchical levels. And crucially, an independent analysis by the ARC Prize Foundation revealed something telling: deep supervision (improving answers through multiple supervision steps) was the primary driver of performance gains, while the recursive hierarchical reasoning itself provided only marginal improvements.

Simplification as Innovation: The Birth of TRM

Jolicoeur-Martineau saw opportunity in this complexity. What if the biological arguments were unnecessary? What if the fixed-point theorems were mathematical overkill? What if you could achieve better results with a radically simpler approach?

TRM strips away the theoretical scaffolding. No hierarchical interpretation. No dual networks. No appeals to neural oscillations or cortical timescales. Instead, it embraces a beautifully straightforward concept: recursive refinement of a latent reasoning state.

The model maintains three components:

x: the input question (embedded)
y: the current predicted answer (embedded)
z: a latent reasoning feature

At each step, TRM recursively updates z given x, y, and the previous z. This is the “thinking” phase—the model is improving its internal understanding. Then it updates y given the current z and previous y. This is the “answering” phase—translating improved understanding into a refined solution.

This process repeats up to 16 times. Each iteration is an opportunity to catch errors, reconsider assumptions, and progressively converge toward the correct answer. It’s parameter-efficient because the same tiny network is reused recursively rather than requiring massive depth. It minimizes overfitting because the model learns to improve any intermediate state, not just memorize specific patterns.

The elegance is in the reinterpretation. HRM’s two latent features (z_L and z_H) were justified through complex hierarchical arguments. TRM recognizes something simpler: z_H is just the current solution, while z_L is latent reasoning that doesn’t directly correspond to a solution but can be transformed into one. Rename them y and z, and suddenly the architecture makes intuitive sense without any biological handwaving.

Less is Literally More: The Two-Layer Revolution

Perhaps the most shocking finding: smaller networks generalize better.

Jolicoeur-Martineau experimented with increasing model capacity by adding layers. Conventional wisdom suggests this should improve performance—deeper networks have greater representational power. But on these small-data reasoning tasks, the opposite occurred. Adding layers decreased generalization due to overfitting.

So she went the other direction. She reduced the network from 4 layers to just 2, while proportionally increasing the number of recursive steps to maintain roughly equivalent computational depth. The result? Massive improvement. On Sudoku-Extreme, accuracy jumped from 79.5% to 87.4% while simultaneously cutting parameters in half.

Two layers. That’s it. A network so tiny it seems almost toy-like. Yet it achieves what models 10,000 times larger cannot.

This finding echoes recent work on deep equilibrium diffusion models, which also found optimal performance with 2-layer architectures. But while those models showed similar performance to larger variants, TRM actually outperforms bigger networks. This suggests something profound about the relationship between model capacity and data scarcity. When training data is limited, architectural efficiency matters more than raw parameter count.

Attention-Free Architecture: When MLPs Beat Transformers

For tasks with small, fixed context lengths, TRM makes another counterintuitive choice: it replaces self-attention with simple multilayer perceptrons (MLPs) applied across the sequence dimension.

Self-attention is the crown jewel of modern AI—the mechanism that enabled transformers to revolutionize natural language processing. But it’s designed for scenarios where sequence length vastly exceeds embedding dimension. For a 9×9 Sudoku grid, this advantage disappears. An MLP operating on the sequence is cheaper and, crucially, more effective.

On Sudoku-Extreme, removing self-attention and using pure MLPs improved accuracy from 74.7% to 87.4%. That’s a 12.7 percentage point gain from removing the supposedly essential component of modern AI architecture.

Of course, this doesn’t generalize to all tasks. For problems with large context lengths like 30×30 mazes or ARC-AGI puzzles, self-attention remains beneficial. But it demonstrates an important principle: architectural choices should match problem structure, not blindly follow prevailing trends.

The Adaptive Computational Time Dilemma

HRM incorporated Adaptive Computational Time (ACT)—a mechanism to determine when to stop iterating on a problem and move to the next training example. Without ACT, the model would spend all 16 supervision steps on every example, which is inefficient when some problems are solved quickly.

HRM implemented ACT through Q-learning, with separate “halting” and “continue” losses. But here’s the catch: the continue loss required an additional forward pass through the entire network. So while ACT optimized time per sample, it doubled the computational cost per optimization step.

TRM simplifies this dramatically. Instead of Q-learning with two losses, it uses a single Binary Cross-Entropy loss predicting whether the current solution is correct. No second forward pass needed. The model learns to halt when it’s confident it has the right answer.

This change had minimal impact on accuracy (86.1% vs 87.4%) but cut training time substantially. Sometimes the simplest solution is the best solution.

Exponential Moving Average: Stability Through Smoothing

On small datasets like Sudoku-Extreme and Maze-Hard, HRM tended to overfit quickly and then diverge—a common problem when models have high capacity relative to data size. TRM addresses this with Exponential Moving Average (EMA) of the weights, a technique borrowed from GANs and diffusion models.

EMA maintains a smoothed version of the model parameters by averaging recent updates. This prevents sharp changes that can destabilize training and improves generalization. On Sudoku-Extreme, adding EMA improved accuracy from 79.9% to 87.4%—a substantial gain from a simple stabilization technique.

The Optimal Recursion Sweet Spot

How many recursive steps are optimal? TRM experiments with different values of T (number of deep recursion cycles) and n (number of latent reasoning steps per cycle).

For Sudoku-Extreme, T=3 and n=6 (equivalent to 42 total recursions) proved optimal. More recursions could potentially help on harder problems, but they incur massive computational slowdowns. Fewer recursions leave performance on the table.

Interestingly, TRM requires backpropagation through the full recursion process, which increases memory requirements compared to HRM’s one-step approximation. But this memory cost is “well worth its price in gold”—the performance gains are substantial enough to justify the additional resources.

Benchmark Domination: When Tiny Beats Titanic

The results speak for themselves:

Sudoku-Extreme (1,000 training examples, 423,000 test examples):

HRM: 55.0%
TRM (MLP): 87.4%
Deepseek R1, Claude 3.7, o3-mini-high: 0.0%

Maze-Hard (30×30 mazes, shortest path >110 steps):

HRM: 74.5%
TRM (with attention): 85.3%
Deepseek R1, Claude 3.7, o3-mini-high: 0.0%

ARC-AGI-1 (geometric puzzles testing fluid intelligence):

HRM: 40.3%
TRM (with attention): 44.6%
Deepseek R1: 15.8%
Claude 3.7 16K: 28.6%
o3-mini-high: 34.5%
Gemini 2.5 Pro 32K: 37.0%

ARC-AGI-2 (even harder variant released in 2025):

HRM: 5.0%
TRM (with attention): 7.8%
Deepseek R1: 1.3%
Claude 3.7 16K: 0.7%
o3-mini-high: 0.0%
Gemini 2.5 Pro 32K: 4.9%

These aren’t marginal improvements. On Sudoku-Extreme, TRM achieves 87.4% while the best LLMs achieve literally zero. On ARC-AGI-2, TRM with 7 million parameters outperforms Gemini 2.5 Pro, which likely has hundreds of billions of parameters.

The ARC-AGI benchmark is particularly significant because it was specifically designed to resist memorization and test genuine abstract reasoning—the kind of fluid intelligence that characterizes human cognition. As François Chollet, creator of ARC-AGI, explains: “Intelligence is measured by the efficiency of skill-acquisition on unknown tasks. Simply, how quickly can you learn new skills?”

The Scaling Paradox: Why Bigger Isn’t Always Better

Modern AI has been dominated by a simple narrative: scale is all you need. More parameters, more data, more compute. The success of GPT-3, GPT-4, and other large language models seemed to validate this approach.

But TRM reveals a different story. On reasoning tasks with limited data, architectural efficiency trumps raw scale. A 7-million-parameter model with the right recursive structure outperforms 671-billion-parameter models.

This isn’t just an academic curiosity. It has profound implications for the future of AI:

Efficiency: TRM can run on consumer hardware. You don’t need massive GPU clusters or cloud infrastructure.

Interpretability: Smaller models are easier to understand, debug, and audit.

Environmental impact: Training and running 7M-parameter models requires orders of magnitude less energy than billion-parameter alternatives.

Accessibility: Researchers and developers without access to massive computational resources can still work on frontier AI problems.

Generalization: The results suggest that for certain problem classes, the path to better performance isn’t more scale—it’s better algorithms.

The Deep Supervision Insight

One of TRM’s key innovations is deep supervision—reusing previous latent features as initialization for the next forward pass. This allows the model to reason over many iterations, progressively refining its solution until it converges to the correct answer.

The model is trained to take any intermediate state (y, z) and improve it. This means that even without gradients, running several recursion processes is expected to bring the model closer to the solution. TRM runs T-1 recursion processes without gradients to improve the latent state, then one final recursion with backpropagation.

This approach completely bypasses the need for fixed-point theorems or one-step gradient approximations. It’s conceptually simpler and empirically more effective.

What Didn’t Work: Failed Ideas Worth Knowing

The paper includes a valuable section on failed experiments—ideas that seemed promising but didn’t pan out:

Mixture-of-Experts (MoE): Replacing standard MLPs with MoE layers massively decreased generalization. MoEs add unnecessary capacity that leads to overfitting on small datasets.

Partial gradient backpropagation: Instead of backpropagating through all n+1 recursions, the authors tried backpropagating through only the last k recursions. This didn’t help and added complexity.

Removing ACT entirely: Without any early stopping mechanism, the model spent too much time on the same examples rather than learning from diverse data, significantly hurting generalization.

Weight tying: Tying the input embedding and output head was too constraining and caused a massive performance drop.

Deep Equilibrium Models: Using fixed-point iteration (as in Deep Equilibrium Models) slowed training and reduced generalization, highlighting that converging to a fixed point isn’t essential.

These negative results are as important as the positive ones—they help future researchers avoid dead ends and understand the design space.

Limitations and Future Directions

TRM isn’t a universal solution. The attention-free MLP architecture works brilliantly for small, fixed-context tasks like 9×9 Sudoku but fails on larger grids. Different problem settings require different architectural choices.

The paper also notes that while recursive reasoning dramatically improves performance compared to larger, deeper networks, the why remains somewhat mysterious. The authors suspect it relates to overfitting—recursive architectures with deep supervision may provide a form of implicit regularization—but they lack a formal theory.

Additionally, TRM is currently a supervised learning method, not a generative model. It produces a single deterministic answer given an input. Many real-world problems have multiple valid solutions, so extending TRM to generative tasks would be valuable.

The Broader Implications: Rethinking AI’s Future

This work arrives at a critical moment. The AI field has been increasingly dominated by a “bigger is better” mentality. Companies compete to build ever-larger models, consuming ever-more resources. The environmental and economic costs are staggering.

TRM suggests an alternative path. What if the next breakthrough in AI isn’t about scale but about architecture? What if we’ve been solving the wrong optimization problem—maximizing parameters instead of maximizing reasoning efficiency?

The success of tiny recursive models on ARC-AGI—a benchmark explicitly designed to measure progress toward artificial general intelligence—is particularly significant. It suggests that genuine intelligence might not require massive scale. It might require the right computational structure: recursive refinement, progressive improvement, and the ability to learn from minimal examples.