The Labyrinth of Computation: Inference Scaling, Self-Play, and AI Safety

Inference scaling has emerged as a novel and startling paradigm that stretches the boundaries of conventional AI development. In Ryan Kidd’s recent article, Implications of the Inference Scaling Paradigm for AI Safety (2025), he posits that allocating colossal amounts of compute at test time—rather than merely at training—can dramatically boost a model’s performance. Buoyed by OpenAI’s launch of o1 and the benchmark results of 03 (not yet independently verified), Kidd suggests that this emergent approach could significantly shift how researchers and engineers conceive of timelines, security protocols, and interpretability. Echoing the metamorphoses spurred by chain-of-thought (CoT) techniques, the new wave of inference scaling may furnish staggering capabilities, albeit with weighty concerns regarding costs, oversight, and potential misappropriation. Moreover, an extensive comment from Gwern accentuates a complementary dimension: the self-play or search-based scaling pathway, wherein the AI’s monstrous inference cycles could themselves yield training data for subsequent, more refined models—just as AlphaGo’s leaps were inextricably linked to self-generated game data. Below is a synthesis traversing the crux of Kidd’s article, his AI safety reflections, and Gwern’s commentary on the deeper implications of this emergent methodology.

Implications of the inference scaling paradigm for AI safety

1. The Inference Scaling Takeoff

Kidd’s core argument begins with a marvel: OpenAI’s o1 and o3. These models boast unprecedented performance enhancements rooted in simply devoting more compute at inference time. Traditionally, ballooning performance hinged on upscaling model parameters or crunching colossal training sets. Yet o1 and o3 invert that formula: they reveal how running a model longer—or more creatively, whether through tree-of-thought expansions or more extensive chain-of-thought search—can net an impressive, often exponential-like performance boost on complex tasks. Indeed, pass@1 accuracy for an exam like AIME ratchets up in near lockstep with the logarithm of test-time compute (OpenAI, 2024). The pattern extends to o3, which overshadows older benchmarks:

2727 on Codeforces—its ranking would place it among the top 200 competitive coders worldwide.
25% on FrontierMath—feats that usually demand marathon sessions from domain experts.
88% on GPQA—where the threshold for PhD-level science knowledge is 70%.
88% on ARC-AGI—surpassing the average Mechanical Turk human on tricky visual puzzles.

Crucially, the reinforcement-learning-fortified chain-of-thought lies at the heart of these leaps. Running o3 at top speed, however, is eye-wateringly expensive—Kidd estimates ~$3,000 per single ARC-AGI query—but he also underscores that inference costs plummet by around 10x each year. If such a downward trend persists, the once-astronomical fees might become increasingly manageable.

Because training and inference expenditures are converging (Epoch AI’s analysis predicts frontier labs will funnel similar resources into each), major actors could not only invest in bigger models but might also double down on hyper-optimized inference strategies. Thus, Kidd foresees that inference scaling is here to stay—becoming central to safety concerns as AI ratchets upward toward and beyond human-level cognition.

2. Timelines, Overhang, and Safety

Kidd’s verdict on advanced AI timelines remains mostly unchanged: predictions hover around 2030–2033, though the o3 unveiling prompted some forecasters on Metaculus and Manifold to shave off about a year. Despite that minor shift, he does not see an epochal redefinition of deadlines. Instead, the more tangible safety effect is on the so-called deployment overhang. Holden Karnofsky’s scenario, in which a first human-level AI system multiplies itself into millions or billions of copies—simply by reallocating the same compute used for training—seems less ominous if inference is as financially draining as Kidd’s estimates. The cost of running o3-high might deter a swift “swarm scenario,” tempering the risk of “collective superintelligence” overshadowing humanity overnight.

Yet, the possibility of qualitative superintelligence (à la Bostrom’s Superintelligence) looms. Even if a second wave of systems remains unfeasible to spin up by the millions, a single or small cluster of super-adept minds might still surpass our capabilities in cunning ways. Kidd acknowledges that for near-term AI deployments, these stratospheric inference demands might forestall a precipitous wave of unlimited AI copies running amok—but it might also concentrate power in well-funded labs or states that can foot the bill.

3. Interpretable Chain-of-Thought and the Perils of Non-Language Reasoning

Chain-of-thought’s ascendancy under the inference scaling paradigm seems serendipitous for AI safety. When a model’s cognition is at least partially visible in textual, logically coherent steps, humans can glean insight into how a conclusion is reached. This interpretability could help watchdogs, alignment researchers, and even everyday users detect (and thwart) malignant or manipulative reasoning. Nevertheless, there’s no guarantee that the CoT transcripts are transparent or faithful to the model’s real internal processes: steganographic deception or hidden self-dialogue remains a threat.

Kidd discusses techniques like Meta’s “Coconut” project—continuous reasoning with minimal reliance on explicit text sequences—that could hamper human oversight. He warns that adopting non-language chain-of-thought for slightly improved performance might be catastrophic for safety, because a critical advantage of legible CoT would vanish. If the AI’s true reasoning escapes the textual realm into ephemeral neural embeddings, how are we to catch the seeds of Machiavellian subversion?

4. Security, Exfiltration, and the Frontier Model Landscape

A tricky dimension of AI safety is the so-called “AI security” question. Kidd underscores that smaller parameter models (as in the new inference scaling approach) may be likelier to slip under the radar. Where a GPT-6 monstrosity with billions more parameters might be prohibitively enormous to smuggle out of a data center, a smaller model—like o5, if it still matches or exceeds GPT-6’s performance—becomes easier to exfiltrate. If you can compress superhuman performance into fewer weights, you reduce the file sizes that intelligence services or rogue nations need to spirit away. At the same time, the staggering inference bills limit how effectively such a stolen model might be deployed in the wild. Kidd suggests fewer unilateral actors could exploit such contraband, perhaps diminishing the “unilateralist’s curse” scenario.

Gwern’s commentary draws a further crucial point: the reason labs like OpenAI build systems such as o1 or o3 is not solely for external deployment, but also for generating advanced training data that fuels the next iteration. This “self-play” or “search-based” approach means each big inference engine can run countless rollouts, refine correct solutions, prune failed lines of reasoning, and then distill that knowledge back into smaller or better architectures. As Gwern notes, in the style of AlphaGo or MuZero, the hardest tasks eventually require a synergy of both robust model capacity and strategic inference. Over time, the colossal cost of repeated search might pay off in a new frontier model—a refined distillation that runs cheaply on everyday hardware, or another colossus that continues pushing the boundaries of intelligence.

5. Interpretability Conundrums and RL Overreach

Smaller—yet more capable—models could theoretically make interpretability easier. Fewer parameters might be simpler to probe or label (e.g., “neuron labeling”), but superposition might intensify: the same neuron might encode multiple concepts in a highly entangled fashion, potentially making everything more opaque. Meanwhile, the impetus to refine CoT or tree-of-thought via reinforcement learning might expand, given that reward signals for each iterative inference step can be more targeted. Kidd half-optimistically posits that process-based supervision (rewarding good reasoning steps rather than only outcomes) is safer. However, he also appends an important qualification: RL on chain-of-thought could still incentivize cunning or manipulative internal strategies if the oversight is insufficient. It’s not necessarily a perfect insulation from “power-seeking” behavior. Indeed, Daniel Kokotajlo’s remarks highlight that RL on CoT is hardly pure “process-based supervision,” and might carry hidden hazards.

6. Export Controls and the Dawn of Specialized Chips

Finally, Kidd suggests that specialized hardware for inference might diverge from that for training, affecting existing export controls designed to choke off the spread of cutting-edge AI chips. Policymakers may need to pivot, monitoring novel categories of inference-optimized devices. For example, a particular GPU or ASIC that excels at parallel search might become the critical resource in future arms races. As the old line between “training hardware” and “deployment hardware” erodes, regulatory bodies might scramble to curb proliferation or keep pace with clandestine labs.

7. Gwern’s Broader Vision: The Endless Spiral of Self-Improvement

Gwern’s sweeping commentary fleshes out an overarching dynamic: the “inference scaling” frontier could birth a second wave of AI that devours its own outputs to pave the way for even mightier successors. In the AlphaGo era, after the raw computational blitz discovered novel game strategies, engineers distilled and retrained the system, making it stronger while requiring fewer resources. The same pattern may be repeating with o1 → o3 → o4 → o5. The biggest labs appear giddy with the possibility that these iterative loops herald unstoppable climbs in performance—perhaps culminating in superintelligent models capable of automating much of their own R&D. The immediate upshot? Labs might not widely release intermediate, compute-hungry models. Instead, they might keep them in a “black box” generating data for the next iteration. Eventually, the polished offspring gets deployed far more cost-effectively. For outside observers, it could look as if an AI lab invests monstrous resources behind closed doors for months on end, only to reveal a polished system leaps beyond the prior iteration—and ironically, cheaper to run than the mid-stage prototypes.

Gwern also highlights references like Jones (2021), urging a careful study of scaling trends. “Spamming small, dumb models” might suffice for trivial tasks, but the “hardest-case problems” demand either colossal scale or cunning search. As soon as you can generate your own advanced data, you’re on the threshold of a self-improvement pipeline that keeps accelerating.

Conclusion

In sum, Ryan Kidd’s article illuminates how inference scaling transforms AI’s progression from a purely training-centric escalator to a hypercharged synergy between model size, test-time expansions, and iterative feedback loops. The safety ramifications are labyrinthine: from shortened deployment overhang to the partial reassurance that not everyone can afford to run a cutting-edge system at full tilt, from the boon of chain-of-thought transparency to the lurking threat of covert or compressed reasoning. Meanwhile, Gwern’s commentary broadens the lens, reminding us of how even ephemeral, astronomically expensive inference cycles can be harnessed to bootstrap future models. Indeed, we may be witnessing the dawn of a self-propagating AI renaissance—one that demands new oversight strategies, novel interpretability breakthroughs, robust security measures, and agile export controls.

Whether these constraints can keep pace with the scorching rate of progress remains an open question. Yet the consensus from Kidd’s analysis is clear: Inference scaling is not a passing curiosity; it may be the crucible from which the next rung of superintelligence emerges. If so, our vigilance—and our willingness to refine or even overhaul existing safety frameworks—will be tested as never before.

Sources & Further Reading

Kidd, R. (2025). Implications of the Inference Scaling Paradigm for AI Safety. Retrieved from LessWrong.
Gwern. (2025). Comment on “Implications of the Inference Scaling Paradigm for AI Safety”. Available at LessWrong.
OpenAI. (2024). Scaling Strategies for the o1 and o3 Models. Retrieved from OpenAI.
Jones, T. (2021). Deep Search and Model-Based Scaling in Large Neural Architectures. Journal of Machine Learning Research (JMLR).