SpaceX Is Rewriting the Rules of AI Training — From Scratch, in C

How Elon Musk’s aerospace company just declared war on the AI industry’s most trusted tool

On the morning of May 28, 2026, Elon Musk posted something on X that quietly sent shockwaves through the AI engineering community. It wasn’t about rockets. It wasn’t about Tesla. It was a single announcement about software — specifically, that SpaceX has nearly completed V1.0 of a custom AI training stack, written in C, designed to run on 220,000 NVIDIA GB300 GPUs, and claiming a potential speed improvement of over 10x compared to Google’s JAX framework.

For most people, this reads like alphabet soup. For anyone who knows what goes into training a frontier AI model, it reads like a declaration of war on the status quo.

The Tool Everyone Uses — And Why That’s a Problem

To understand why this matters, you first need to understand JAX.

JAX is a machine learning framework developed by Google. It’s fast, flexible, supports automatic differentiation, and compiles code to run efficiently on GPUs and TPUs. Nearly every major AI lab — OpenAI, DeepMind, Anthropic, and yes, even xAI in its early days — has used JAX or JAX-adjacent tools as part of their training infrastructure.

xAI’s original Grok stack, in fact, was built on Kubernetes, Rust, and JAX. It was a solid, production-grade setup used to train some of the world’s most capable AI models.

So why throw it all out?

The answer lies in a concept called abstraction overhead. Frameworks like JAX are designed to be general-purpose. That generality is incredibly useful when you’re a research team trying to experiment quickly across different architectures, hardware configurations, and model types. But generality comes at a cost: the framework has to make assumptions, add safety layers, and introduce intermediary steps that, at extreme scale, add up to real, measurable inefficiency.

When you’re training a model on 10 GPUs, that overhead is trivial. When you’re training on 220,000 GPUs, every percentage point of inefficiency translates into millions of dollars and weeks of lost time.

Going Bare Metal on 220,000 GPUs

What SpaceX built is fundamentally different. Instead of using a high-level framework that abstracts the hardware away, their team wrote a training stack in C — one of the oldest, lowest-level programming languages still in common use. C gives programmers direct control over memory management, processor instructions, and hardware communication. There’s no runtime, no garbage collector, no abstraction layer between your code and the silicon.

The stack “exact-maps” to 220,000 NVIDIA GB300 GPUs — meaning it was written with precise knowledge of exactly how many GPUs exist in the cluster, how they’re connected, and how data should flow between them. This isn’t software designed to run on any hardware. It’s software designed to run on this exact hardware, as efficiently as physically possible.

The networking side is equally critical. The stack uses 800G NICs — 800 Gigabit Network Interface Cards — to move data between nodes. In distributed training, the speed at which GPUs can communicate with each other often becomes the real bottleneck after individual GPU performance. Every gradient update, every activation passed between model layers — it all travels over this network fabric. Faster networking means less time waiting, more time computing.

And threading through all of it is pipeline parallelism — a technique where the layers of a model are broken up into stages and distributed across different GPUs, so that multiple batches of data can flow through different stages simultaneously, like an assembly line. For models with hundreds of billions of parameters, this isn’t a nice-to-have optimization. It’s what makes training possible in the first place.

What 10x Actually Means

Musk’s claim of “over an order of magnitude” speed improvement versus JAX for large training runs is extraordinary. But what does 10x actually mean in practice?

Consider this: Llama 3.1 405B required approximately 3.8 × 10²⁵ floating point operations to train. Meta used 16,000 H100 GPUs to complete that run in 54 days. Scale that to a larger model, and you’re talking about training runs that can cost hundreds of millions of dollars and span months.

A 10x improvement doesn’t mean one thing gets faster. It means the entire economics of frontier AI training shifts. Experiments that took three months now take three weeks. Models that cost $500 million to train now cost $50 million. Iteration cycles that were quarterly become monthly. The compound effect on research velocity is nearly impossible to overstate.

To be fair, independent verification doesn’t yet exist. Experts have noted that achieving consistent 10x improvements at this scale is genuinely challenging — Amdahl’s Law, which describes the limits of parallel speedup due to sequential bottlenecks, places hard constraints on what’s achievable. Communication overhead between 220,000 GPUs doesn’t vanish just because your stack is well-written. But even achieving a fraction of the claimed gains — 3x, 4x, 5x — would be transformative.

Musk also confirmed that the new training stack will power Grok v5, xAI’s next major model release. That’s not a roadmap item or a future possibility. That’s a commitment.

The Colossus Cluster: Where This All Lives

The hardware context for this announcement matters enormously. SpaceX isn’t operating a few thousand GPUs in a modest data center. Following the acquisition of xAI in February 2026, SpaceX now controls what may be the world’s most powerful AI training complex: Colossus 2 in Memphis, Tennessee.

Colossus 2 became operational at gigawatt-scale power in January 2026, making it the world’s first coherent AI training cluster to reach that threshold. The facility houses approximately 550,000–555,000 NVIDIA Blackwell-series GPUs — primarily GB200 and GB300 chips — and operates at roughly 1 GW of power. For reference, that’s equivalent to the peak electricity demand of a city the size of San Francisco. Plans are in place to expand to 1.5 GW.

As of April 8, 2026, Musk confirmed that Colossus 2 is simultaneously running seven distinct model training jobs, including image generation models and language models scaling up to an almost incomprehensible 10 trillion parameters. For context, GPT-4 is widely estimated to have approximately 1.8 trillion parameters. The 10T variant SpaceX is training would represent a generational leap in scale.

The dual 1T and 1.5T variants being trained in parallel are also notable — running parallel architecture experiments at the same parameter count is a classic technique for rapidly testing different approaches without committing the full compute budget to a single bet.

And then, just weeks ago, Anthropic announced it had signed a deal to use the full compute capacity of Colossus 1, adding more than 300 megawatts of compute — equivalent to over 220,000 additional NVIDIA GPUs. NVIDIA celebrated on X: “Two frontier labs. One accelerated computing platform.”

The GB300 itself is no ordinary chip. Built on NVIDIA’s Grace Blackwell Ultra architecture, it’s a liquid-cooled system capable of 1,440 PFLOPS per rack, designed specifically for large-scale AI training and inference. Musk has called it “the best AI computer.” Having 220,000 of them, all mapped precisely in software, is something no other organization has done before.

SpaceX Engineering Culture Meets AI Infrastructure

To understand why SpaceX was the organization to attempt this, it’s worth thinking about the culture that built it.

SpaceX is famous for vertical integration — the philosophy that if you want to do something right, you build it yourself. They manufacture their own rocket engines, their own avionics, their own fuel systems. When commercial suppliers couldn’t meet their standards or timelines, SpaceX simply built internal teams to own the problem end-to-end. The result was Falcon 9, Starship, and a commercial spaceflight revolution.

That same philosophy, applied to AI infrastructure, produces exactly what Musk announced: a custom training stack in a low-level language, written specifically for the exact hardware you own, optimized for the exact workloads you’re running. No dependencies on external frameworks. No abstraction layers you didn’t write. No performance you didn’t choose to leave on the table.

Writing production software in C in 2026 is not a nostalgic choice. It’s a deliberate statement about priorities. C gives you control. It gives you predictability. And at the scale SpaceX is operating — where a single percentage point of GPU utilization translates to millions of dollars — control and predictability are worth the engineering cost.

The multi-GPU and multi-node training infrastructure required to make this work is itself a formidable engineering challenge. NVLink provides high-speed, direct GPU-to-GPU links with up to 1.8 TB/s bidirectional bandwidth. NVSwitch aggregates those connections so every GPU can communicate with every other GPU. At 220,000 GPUs, the coordination problem is staggering. Every layer placement decision, every gradient synchronization, every pipeline stage assignment has to be exactly right to avoid the communication bottlenecks that typically kill distributed training efficiency.

SpaceX designed around all of it, from scratch, in a language that doesn’t hold your hand.

What This Means for the Rest of the AI Industry

The competitive implications here are significant — and somewhat uncomfortable for the incumbents.

OpenAI, Google DeepMind, Meta, and Anthropic have all invested heavily in their own infrastructure over the past few years. But all of them still rely, to varying degrees, on shared frameworks and general-purpose tooling. The shift toward custom infrastructure is a trend that has been building — much as national laboratories in the HPC era eventually built specialized software for their supercomputers — but SpaceX appears to have executed on it more aggressively than anyone else.

If the performance claims hold under benchmarking, the practical result is that SpaceX/xAI can train a given model faster and cheaper than any lab using JAX-equivalent infrastructure. That compounds over time. More training runs. More experiments. More rapid iteration. More capable models, sooner.

The counterargument, worth taking seriously, is that framework agility also matters. JAX and its equivalents are powerful partly because they let researchers experiment quickly without deep infrastructure knowledge. A hyper-optimized bare-metal stack is harder to modify, harder to debug, and harder to adapt to novel architectures. SpaceX is betting that once you’ve found the architecture you want to scale, you want to scale it fast — and that the engineering investment in a custom stack pays for itself at this level.

Given that they’re already training a 10T parameter model on their existing cluster, they may have already made that bet and won it.

The Road Ahead

Version 1.0 is nearly complete. That word — version — matters. This is the beginning of an infrastructure roadmap, not the end of one. V2.0 will presumably optimize further, expand to additional hardware, and incorporate lessons from the first large-scale training runs.

The open question that matters most is validation. Benchmarks published by SpaceX or xAI will be scrutinized heavily by the research community. Independent replication will take time. And Musk’s history of ambitious timelines — in every domain — warrants appropriate skepticism about the “over an order of magnitude” figure until production results are published.

But here’s what’s not in dispute: SpaceX has built a 1-gigawatt AI training facility, filled it with over half a million of the world’s best GPUs, merged it with one of the most capable AI research organizations on the planet, and now written custom software to squeeze every possible cycle of performance from that hardware. Whatever the final benchmark number turns out to be, the ambition and execution are real.

The SpaceX-ification of AI is underway. And if history is any guide, the rocket scientists tend to land their ships.