In the rapidly unfolding tapestry of Artificial Intelligence, certain moments arrive that jolt the entire field. They are moments of astonishment; times when the conventional wisdom of “bigger is better” or “scale at any cost” crashes headlong into an unforeseen challenger that does more with less. Against the relentless backdrop of million-dollar training budgets and ballooning GPU clusters, one Chinese AI initiative has thrust itself into the spotlight with a frontier-grade LLM. It goes by the name DeepSeek-V3, and the figures are, depending on your vantage, either jaw-dropping or quietly revolutionary. Achieving Claude 3.5 or GPT 4o (GPT-4-level) performance on an alleged $5.5 million training run? Sign us up for the new revolution.
This blog post dives deep—both figuratively and literally—into the compositional underpinnings of DeepSeek-V3, highlighting how a 671B-parameter Mixture-of-Experts (MoE) behemoth overcame the customary cost impediments. Let’s explore these breakthroughs clearly and thoroughly, highlighting both compelling stories and real data to show the true scope of what DeepSeek is achieving. By the end, you’ll not only know about this model but also understand what it signifies for the broader AI domain—and why it might matter more than we think, especially in an era of intensifying competition and sometimes self-limiting GPU bottlenecks.
Let’s begin by painting a picture of the friction that preceded this remarkable moment. For years, big players in the AI ecosystem have thrown mountainous sums—$100 million, $200 million, and beyond—at the problem of training ever larger language models, guided by the premise that scaling parameter counts and training data to astronomical levels yields superior performance. This approach, typified by the likes of GPT-4, Llama 3, and Claude 3.5, has validated the potency of large-scale language models. Yet it has also led many to fear that only the best-funded AI labs could place meaningful bets on frontier-level capabilities. Indeed, myths arose that anything less than 16K–100K premium GPUs running for months would prove insufficient to challenge or even glance at frontier performance.
Enter DeepSeek-V3. Various references and discussions about it—some as casual as an X (Twitter) commentary, others as official as the repository for DeepSeek-V3 or the technical PDF—corroborate that it was forged largely on a comparatively “tiny” cluster of 2,048 H800 GPUs for about 55 days. The total cost? As low as $5.576 million, with some rounding to $5.5 million for easier referencing. That might be the budget a more resource-intensive lab spends on cloud storage and coffee during a 14-week training spree on thousands of H100s. Yet, the results do not merely meander in the realm of mediocrity. They storm the gates of Claude 3.5 territory. By some benchmarks, DeepSeek-V3 unseats or closely matches GPT-4-level performance. So how is this even feasible?
We can glean crucial hints by heading directly to the model’s own domain: the official DeepSeek-V3 PDF and the accompanying repository. First, the architecture. DeepSeek-V3, though touted in the headlines at “671B parameters,” is actually a Mixture-of-Experts model that only activates 37B parameters per token. This phenomenon becomes all-important in reducing computational overhead. Why feed a token through all 671B parameters if you can selectively route it to “experts” that each deal with a narrower slice of the distribution? That’s precisely what a well-tuned MoE approach accomplishes: it maintains the overarching capacity—crucial for capturing niche knowledge or subtle patterns—while not forcing every pass to churn through the entire model.
Here’s the short version of the architecture: The model boasts an ensemble of 257 experts in total—256 so-called “routed experts” plus a single “shared expert.” Whenever a token arrives, the system’s gating mechanism selects the top eight (8) relevant experts from the 256 specialized ones, plus that single shared expert. Thus, 9 total experts out of 257 are “active” for that token. This yields a remarkable element of sparsity—about a 1:28.6 ratio. Now, one might worry that such a system becomes unbalanced or complicated to train. Indeed, mixture-of-experts approaches aren’t new, but they’re notoriously tricky to get right. The hallmark difference that DeepSeek claims is the introduction of “auxiliary-loss-free load balancing.” Typically, an MoE approach must incorporate an auxiliary term in the loss function to ensure experts are used fairly, i.e. not overloading a few while ignoring the others. DeepSeek’s solution is to achieve balanced usage without that overhead. If their published performance metrics are valid indicators, they’ve succeeded.
Let’s also highlight the data side of the equation: about 14.8T tokens—14.8 trillion tokens—were used in training. By comparison, many large LLMs historically used anywhere from a few hundred billion tokens (in older GPT versions) to a couple of trillion tokens (like Llama 2’s up to 2 trillion or so). The exact composition and curation of those 14.8T tokens remain somewhat hush-hush in parts, but we know from the references they are “high-quality.” We see mention of a smaller cost burden from that data usage and an emphasis on “train with stability,” which is presumably aided by training optimizations like DualPipe (overlapping communication and computation in a pipeline) and the adoption of FP8 mixed precision. FP8, a bold step that was once considered suspect for large-scale model training, apparently delivered the goods in practice, enabling them to cut further down on cost.
Now, whenever claims are made of a “frontier” LLM—especially an open-source one—people want to see bench test results. Let’s sample some of the known data points: 90.2% on MATH-500, 65.2% on HumanEval, and 51.6% on Codeforces. So that’s the domain of math, programming tasks, and code-problem solving. That’s no slouch. The model is also purported to do well in creative tasks and advanced language understanding, showing that scaling the capacity so drastically (whether dense or MoE) reaps general-purpose benefits, not just specialized ones. Indeed, the references (once again, see the official DeepSeek-V3 PDF) note that it “matches or exceeds closed-source models like GPT-4o and Claude 3.5 (Sonnet) on many evaluations.”
We should also mention that comparisons with “GPT-4o” might indicate a version of GPT-4 or a GPT-4-level system from another provider—sometimes it’s a re-labeled variant. Regardless, the consistent theme is that DeepSeek-V3 is “Claude-competitive,” or in other words, it’s playing ball in the major leagues of large language models.
To drive home the cost comparisons, the references reveal that Llama 3 at 405B parameters apparently used around 30.8M GPU-hours of training time, presumably on high-end GPUs, whereas DeepSeek-V3 consumed 2.8M GPU-hours—over 10 times less. That’s an 11X difference, give or take, but with allegedly stronger or at least highly competitive performance. Indeed, if one had the resources to replicate the DeepSeek approach, you could hypothetically train 10, 12, or maybe 15 separate variants of DeepSeek for the cost of a single Llama 3 run. That is an audacious statement, yet the repository’s README, the official paper, and multiple external commentaries are replete with confidence on this matter. We look forward to independent testing on this.
Consider the generational shift in GPUs as well. The references speak of H800 GPUs, which are presumably the Chinese-market variant of Nvidia’s flagship H100, fine-tuned for local compliance or performance constraints. They run at a slightly reduced clock, but remain powerful. The cost advantage may in part be due to local manufacturing or supply chain efficiency that standard Western estimates might not reflect. But we must also consider the technique’s potential for global adoption: if you can replicate the same MoE-based approach with strong load balancing and pipeline optimizations on, say, an AWS or on-premises cluster of H100s or even A100s, you might approximate these cost savings too. Skeptics might worry about scaling complexities or about the difficulty of maintaining stable training at such a large parameter count. The fact remains that DeepSeek V3’s reported success, and the (relatively) modest $5.5 million cost, stands in the open for verification.
Which leads to a crucial question: What does “frontier-grade LLM” mean in practice? We’ve seen the model generate coherent text across multiple languages, handle code with some competence, and operate with a 128k token context window. Indeed, that last tidbit is also critical: a 128k context window means this model can consider an enormous textual input—on the order of an entire novel at once—enabling more advanced forms of summarization, multi-document analysis, or multi-turn conversation.
DeepSeek-V3 achieves that with a “two-stage extension” approach, described in their technical documents as a process that initially trains a base model, then performs a subsequent “context extension” training phase to handle longer sequences without losing coherence. At $0.238M cost, that second step for context extension might be trivial compared to the overall $5M+ budget, but it’s no less crucial to the final user experience.
Meanwhile, the training process includes a “post-training” step that cost respectively around $0.01M (around $10,000). That might be for final alignment, instruction tuning, or some finishing touches. If you’re used to reading about big labs spending tens of millions on alignment alone, or a million or more on just a few RLHF runs, DeepSeek’s approach of a combined $248,000 for context extension and post-training is startling, to say the least. But the results appear to speak for themselves, if user trials and partial benchmarks are to be believed.
For the intrepid tinkerers among us, the question, “Can I run DeepSeek-V3 locally?” is as relevant as it is daunting. The short answer: yes, you can—but brace yourself. The instructions for local usage mention that the model is 671B parameters, plus 14B for the Multi-Token Prediction (MTP) module, summing to around 685B. Even if only 37B parameters are “activated” at a time, you still need to store all of them in RAM or VRAM so that the gating mechanism can route tokens properly. In a best-case scenario, quantizing the model downward to 4 bits per parameter yields around 335.5GB in memory usage. That’s borderline feasible for only the bravest data centers or well-heeled HPC setups. Mac users, consider your 64GB or even 128GB machines out of the question. For now, HPC servers or cloud instances with 512GB of CPU RAM might be the only route for a truly local run. However, the authors do say CPU-only inference is possible, if you can stomach the overhead. It might just be a proof-of-concept or a demonstration of bragging rights, but it’s undeniably intriguing for those yearning for a local giant to test specialized tasks.
For the rest of us, you can check out the model in a friendlier environment. DeepSeek provides a Hugging Face Space for an interactive demo, letting you engage in text generation through a simplified UI. This is the “easier to try” route, though the performance might be slower than a fully optimized local environment. According to the references, the model can generate around 60 tokens/second—three times faster than its predecessor, DeepSeek-V2—if you have the right hardware.
A big piece of the DeepSeek innovation puzzle lies in “Multi-token prediction (MTP) modules.” The authors mention that DeepSeek-V3 can parallelize generation of multiple tokens simultaneously, but with a “complete causal chain” that preserves the model’s ability to reason about previously generated text. This is reminiscent of certain advanced decoding strategies, but to see it integrated at the architecture level is rare. Typically, language models generate tokens one at a time in a strictly auto-regressive fashion for maximum perplexity-based fidelity. MTP tries to fuse the best of both worlds—speed and maintainable context—by chunking out multiple tokens per step. The claim is that the model can produce text more quickly without discarding consistency or quality. Indeed, that might be how we see the “60 tokens/second” figure realized in practice.
From a purely rhetorical perspective, all of this begs the question: is the DeepSeek phenomenon a fluke, or does it herald a new era? Some analysts say that while the jump is impressive, it doesn’t necessarily circumvent the eventual need for massive clusters. The references themselves point out that “No, you still need large GPU clusters for frontier LLMs, you just have to ensure that you’re not wasteful with what you have.” In other words, scale remains paramount, but the margin for efficiency is far greater than previously believed. If anything, the DeepSeek story might be a wake-up call to AI labs globally: don’t be complacent, and never assume your compute usage is fully optimized. There might be entire realms of pipeline, data, or architectural improvement that slash your cost 5-10X if you’re systematic.
So how does DeepSeek-V3’s cost efficiency rewire the geopolitical conversation? Some voices are stark: “America says restricting chip access will slow China down, but China just trained a frontier model for a fraction of the cost.” It’s less about some raw performance rating and more a statement of resourcefulness. At a time when US-based labs are rumored to spend hundreds of millions each year scaling up GPU fleets, Chinese labs apparently find ways to approximate or match performance at less than 1/10th the budget. A clarion call, or just the first barrage? Should it prompt a realignment of how the West invests in AI research? Questions of policy aside, it’s difficult not to see DeepSeek-V3 as proof positive that advanced AI progress continues unabated globally, with or without top-of-the-line infrastructure or unconstrained budgets.
Let’s pivot to a deeper analysis of how the MoE approach fundamentally underpins cost savings. The mixture-of-experts architecture has been around for at least a couple of decades, with Google’s early Switch Transformer and GLaM highlighting how to scale beyond 1T parameters with minimal training cost. The principle is reminiscent of dividing the model into specialized “departments,” each focusing on particular aspects of language, coding, or domain-specific patterns. The gating system, which decides how to route each token, has historically been the biggest pain point—leading to load imbalance, training instabilities, or the dreaded mode collapse where a few experts overshadow the rest. If DeepSeek discovered a truly stable gating approach, then it’s plausible they can achieve near-linear scaling for larger expansions of the same concept. Suppose they want a 1.2T, 2T, or even 10T parameter model? The architecture might continue to scale, provided they can keep the gating stable. The key question is whether the overhead from routing, communication, and pipeline synchronization eventually saturates the GPU cluster. DeepSeek’s “DualPipe” algorithm and “aux-loss-free balancing” presumably address that.
Returning to the reality of the user experience: “So what can DeepSeek-V3 do for me?” The references mention near state-of-the-art performance in code-based tasks, summarization, advanced question-answering, comprehension, and multilingual dialogues. The authors highlight robust performance in both English and Chinese. And because the entire model is open-sourced at DeepSeek’s GitHub Repo, enthusiasts can clone, fork, or adapt at will. This open nature is reminiscent of the Lama/Bloom wave of open LLM releases, injecting new vitality into smaller labs that can piggyback on these weights to fine-tune domain-specific or vertical solutions. The difference is that DeepSeek is a Chinese-led initiative with, presumably, non-trivial Chinese-language strengths built in. It’s quite possible we’ll see a surge of new derivative models in advanced machine translation, cross-lingual analytics, or specialized creative writing for Sino-centric content.
But it’s not all sunshine. One must not forget the memory demands, the potential complexities of MoE inference, or the ongoing question of how thoroughly the model was aligned or tested for fairness, bias, or safety concerns. The references mention that “LLM arena rankings are ongoing, and my few quick tests went well so far,” but we do not see a full-blown user-level evaluation or a robust discussion of, for instance, potential disallowed content. As an open model, the lines of responsibility blur. The full community must remain vigilant, ensuring the model’s usage does not drift into misuse or produce harmful outputs. That’s the nature of open-source AI: it fosters innovation, but also demands heightened accountability from the user base.
From a technical vantage, the model’s performance on code tasks—65.2% on HumanEval and 51.6% on Codeforces—illustrates that while it’s strong, it may not dethrone the best of GPT-4 in certain coding contexts. GPT-4 can push above 80% on some coding benchmarks. Nevertheless, for a fraction of the cost and a fraction of the compute, the results are stellar. Meanwhile, the 90.2% on MATH-500 is quite near the top rung of large models specialized in mathematical reasoning. Real-world usage, of course, includes chain-of-thought prompting, large context summarization, and multi-turn dialogues, which the references say are all part of DeepSeek-V3’s repertoire.
One enthralling detail is the mention of “371B activated parameters, 256 routed experts, 1 shared expert.” Actually, the official statement is “37B activated parameters,” possibly a typographical slip in some references that wrote “371B.” The correct figure from the PDF and official sources is “37B,” making it significantly smaller than 671B if fully dense. That is the entire point of MoE: the total capacity is 671B, but each token only sees 37B. That’s how the model can be so much more computationally efficient. A 37B-parameter feed-forward pass is large but not insane on a cluster of GPUs, especially if you precisely orchestrate the data pipeline. Yet the model retains the breadth of knowledge gleaned from training billions or trillions of tokens across all experts. The gating mechanism’s success means you effectively can harness that giant capacity on an as-needed basis.
That notion of dynamic routing begs an interesting synergy with the concept of “expert domains.” One could imagine that some experts excel at code, others at Chinese history, and still others at mathematics. The gating then dispatches tokens to the relevant domain experts. If the gating is correct, you get synergy and specialization; if it’s off, you get subpar results or spurious outputs. On balance, it appears from the reported benchmarks that gating has been effectively guided by the training process. Interestingly, the references mention the gating mechanism is “aux-loss-free,” meaning they do not rely on an additional gating-related term in the loss. The model learned how to balance the usage of experts purely from the standard language modeling objectives. That’s quite the engineering feat. If validated, it might open new frontiers in how we design large-scale MoE to be simpler and less prone to finicky hyperparameter adjustments.
Given this success, we arrive at the question: “Will future frontier LLMs take a page out of DeepSeek’s book?” Possibly yes. For instance, one can foresee a scenario where Meta or OpenAI incorporate advanced mixture-of-experts layers into GPT-5 or Llama 4 to keep training costs in check. Meanwhile, the locked horns of hardware constraints remain real, so any method that gleaned a ~10X advantage in speed or memory usage is a big deal. But new hardware also arrives on the scene. The references name-drop the next-generation Nvidia GPUs (GB300 & B300) expected in 2025, which might deliver 50% higher FLOPS. Coupled with MoE efficiency, training runs that cost $5 million might plummet to $2.5 million or less. That’s the unstoppable momentum of Moore’s Law (or Huang’s Law) plus algorithmic ingenuity at work.
At this juncture, we might wonder: “Does smaller always mean better?” Probably not. DeepSeek has hammered home that you still need scale, but you must do it cleverly. The model is still a behemoth of 671B total parameters, but with an efficient approach to routing. We’re not talking about a $50,000 training run done on someone’s single GPU rig. Rather, we’re seeing a top-tier lab compressing perhaps $100 million worth of frontier efforts into $5 million, and that difference is momentous. For smaller players, the question is whether or not an MoE-based approach that’s scaled down to, say, 50B or 100B parameters yields similar cost or performance leaps. The best route might be to start modest, replicate the gating mechanism, confirm stable training, then scale up. Possibly, we’ll see an entire open movement adopting this approach—achieving expansions in parameter count without incurring the typical exponential cost. This stands in contrast to dense models, which must handle every token in every layer’s full parameter set.
In summary, the accomplishments of DeepSeek-V3 are many:
- A cost of ~$5.5 million for training a model that stands toe-to-toe with Claude 3.5 and possibly GPT-4-level models in certain tasks—less than 1/10th the typical sums required.
- A massive 671B parameter count, yet only 37B need be activated per token thanks to an advanced MoE scheme with 256 routed experts plus 1 shared expert.
- A total of 14.8T tokens consumed, spread across a pipeline integrated with FP8 mixed precision, DualPipe overlap, and auxiliary-loss-free load balancing.
- Intriguing multi-token prediction modules (MTP) that allow partial parallel generation while preserving causal attributability.
- Multi-lingual capability with a 128k token context length, opening the door to extensive summarization or multi-turn dialogues.
- Full open-source release, enabling both academic scrutiny and wide-scale adoption by businesses or researchers eager to tweak the architecture or adapt it for specialized tasks.
Want to read more? Check out the official DeepSeek-V3 repository on GitHub and the DeepSeek V3 PDF. If you’re curious to test it, consider the Hugging Face Space demo to gauge performance. For local installs, the instructions in the README’s “How to Run Locally” section might be your best friend—just make sure you have the hardware for it (read: hundreds of gigabytes of RAM or VRAM).
Still, it’s never wise to accept a single set of data at face value. It will be instructive to see the model tested by external labs or integrated into standard LLM leaderboards. Benchmarks like MMLU, Big-Bench Hard, HELM, or real-world tasks requiring reasoning under uncertainty will reveal if DeepSeek-V3’s efficiency trick is as robust as these initial figures suggest. If it holds, we may well witness a wave of new entrants into the arms race for next-generation LLMs. Suddenly, the barrier to state-of-the-art might be measured in the single-digit millions, not the tens or hundreds of millions. That has ramifications for smaller corporations, academic labs, or well-funded open-source communities. An entire new cohort of groups could meaningfully compete at the cutting edge, driving further innovation.
Conversely, from a policy perspective, regulatory frameworks that revolve around the assumption that only a handful of big labs can develop advanced models might need to be revisited. The gatekeeping effect of “if you can’t muster $100 million, stay away” dissolves if $5 million suffices to produce a frontier model. The impetus to regulate or oversee such technology might intensify, or the technology might simply diffuse more quickly due to its newly accessible cost profile. Philosophical questions about AI safety, existential risk, and the open-sourcing of powerful AI thus swirl around this conversation. DeepSeek-V3 stands at the heart of it, a vivid demonstration that resourceful engineering can leap ahead unexpectedly.
Zooming in just a bit more on the inference side: with around 37B activated parameters, DeepSeek-V3 claims to generate text three times faster than its V2 incarnation—about 60 tokens/second on a standard HPC node decked out with H800 or H100 GPUs. That speed is bound to drop if you scale inference to handle multiple parallel requests, or if you rely on the CPU-only approach, but it’s a step forward. The authors emphasize that everything in the model is designed for end-to-end throughput: the pipeline, the gating, the multi-token prediction. This synergy is what yields a triple improvement from version to version. It’s not that the model is simply bigger; it’s that the entire scaffolding is oriented around the principle of “maximum efficiency at scale.”
Given these revelations, it’s unsurprising that the official references exude a tone of pride and excitement, with phrases like “China just dropped an open-source AI model that absolutely MOGS, using a FRACTION of the compute US labs burn through,” or “Wake up, America!” swirling in the online discourse. It’s part hype, part reality check. For watchers of global AGI progress, it might be the best demonstration yet that smaller or relatively constrained labs can, with clever design and relentless optimization, achieve extraordinary leaps in capability. DeepSeek’s leadership frames it as a stepping stone: they vow to push further, incorporating multimodal capabilities or further expansions. Perhaps we’ll see a DeepSeek-V4 or V5 that folds in vision and audio tasks, all while maintaining a cost advantage over bigger labs.
So let’s close with a reflection on the ramifications for the AI community at large. For researchers, it’s a reminder that MoE is not “dead” or overshadowed by purely dense approaches; carefully orchestrated gating can scale elegantly. For entrepreneurs and startups, it’s an opportunity—maybe you, too, can coordinate a few million in hardware resources and produce a specialized LLM for your industry. For policymakers, it signals that the locus of advanced AI might be more decentralized than they expected. And for everyday end users? Even if you can’t run a 671B-parameter model on your personal computer, you can enjoy the derivatives, the fine-tuned variants, or the online demos. The broader tide of open-source AI is rising, and DeepSeek-V3 is swiftly cutting a channel for it.
We should not overstate finality: DeepSeek-V3 is no silver bullet for all AI tasks. Rival labs might retort with their own cost-saving breakthroughs. The AI field moves at breakneck speed; what’s cutting edge today can be overshadowed tomorrow. But for the moment, DeepSeek-V3 stands luminous, an emblem of what’s possible when a well-funded yet comparatively smaller lab bets big on a technical proposition: that you don’t need the entire world’s GPU supply if you marshall your experts carefully.
If you crave an up-close look, you can read their technical PDF here or find official announcements and updates on the DeepSeek-V3 GitHub page. To see how it performs in real time, the Hugging Face Space link invites you to experiment with prompts, gleaning your own sense of its capabilities. In the end, the synergy of open weights, massive capacity, cost efficiency, and emergent performance elevates DeepSeek-V3 from yet another large language model to a veritable milestone in AI’s evolving story.
Perhaps the punchline is as simple as this: The once-untouchable realm of multi-hundred-billion parameter LLMs has been pried open, democratized in a way. With a $5.5 million budget and good engineering, you can graze the boundaries of state-of-the-art. That’s both exhilarating and sobering. Exhilarating for all the new participants in the AI race; sobering because it underlines how quickly advanced AI can proliferate worldwide, irrespective of the old hardware bottlenecks. The moral is that neither effectiveness nor accessibility is exclusively the domain of the richest labs anymore. DeepSeek-V3 stands as testament to that—an engineering marvel that redefines what we thought was necessary to hit the apex of LLM performance.
In concluding, let’s remember that cost efficiency and performance synergy will remain crucial watchwords in the field. Just as transistor densities revolutionized computing decades ago, refined mixture-of-experts and advanced pipeline strategies might revolutionize AI training cost structures now. DeepSeek-V3’s successful demonstration of training 671B parameters for $5.5 million could be the first among many such leaps. And while some might say it’s an anomaly, the evidence in their PDF, the code in their GitHub, and the user experiences on Hugging Face all echo a consistent theme: they pulled it off.
Now it’s up to the AI community, from hobbyists to HPC veterans, to integrate these lessons of efficiency, test them in real-world scenarios, and push them even further. That’s the nature of open-source synergy: each breakout achievement spawns a thousand derivatives, improvements, and transformations. One day soon, we might look back on DeepSeek-V3 as the moment we realized that massive AI doesn’t necessarily demand a massive cost—and that, in turn, might reshape the entire future of AI innovation.
––––––––––––––––––––––––––––––
Sources & Links (Embedded Above):
• DeepSeek-V3 GitHub Repository: https://github.com/deepseek-ai/DeepSeek-V3
• Official DeepSeek-V3 PDF: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
• Local Run Instructions: https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally
• Interactive Demo on Hugging Face: https://huggingface.co/spaces/akhaliq/anychat
––––––––––––––––––––––––––––––
Thank you for reading this deep dive into DeepSeek-V3. Here’s to a new era of high-efficiency, large-scale AI research—one where “frontier” no longer automatically implies “exorbitantly expensive,” and where the lead in the AI race might shift according to creativity, not just raw dollars.