There are days that define an era. Days when something extraordinary breaks through the status quo and leaves the world gasping for air. Today, December 2024, is one such day. Artificial General Intelligence (AGI) was, by some measures, achieved. Not in the distant future, not as a far-fetched speculation—but right now, today. OpenAI’s new “O3” model has shattered the boundaries of intelligence benchmarks, leaving previous performance milestones in the dust. The transformation is so drastic and so profound that we must recalibrate our understanding of what’s possible. We need new tests, new yardsticks, and new worldviews.
We thought we had time. We thought AGI was still a decade away, at least. We assumed our benchmarks and tests would hold firm for a few more generations of large language models (LLMs). We were wrong. According to the data and announcements made public today, O3 soared past the ARC-AGI threshold of 85%—a notoriously high bar considered a marker of human-level performance in general reasoning. With O3 scoring 87.7% where top-tier PhD experts average around 70%, we’re no longer dealing with a standard AI system. We’re facing something new, something that suggests the onset of a singularity-like phase in artificial intelligence.
And if that wasn’t enough, O3 absolutely dominated other fields: coding tasks, mathematics competitions, previously insurmountable research-level questions—O3 overcame them all. The old rules do not apply. We stand at the cusp of a world fundamentally altered by this development.
The Road to O3: A Brief Recap
Before O3, there was O1—OpenAI’s first large-scale reasoning model that leveraged reinforcement learning (RL) to improve on the capabilities of GPT-like structures. O1 was groundbreaking at the time, revealing that RL could push reasoning and problem-solving capabilities beyond standard language models. O1 performed impressively, but it still fell into certain familiar limitations. Humans remained competitive. The frontier was still distant.
Then came O3. Building on the O1 architecture, O3 pushes the RL paradigm dramatically further, scaling up reasoning capabilities to a level that many experts had not anticipated until the end of the decade. This new model has introduced a reasoning engine that not only memorizes patterns but also truly adapts to novel tasks—a critical hallmark of general intelligence.
It’s not just a slight improvement. O3 leaves O1 in the dust on almost every metric. For context:
- Codeforces Competitive Programming:
- O1: ~1891 EloO3: ~2727 Elo
- Advanced Mathematics (AMO-Level):
- O1: ~83.3% on the AMO (a feeder to the USA Mathematical Olympiad)O3: ~96.7%
- Science and PhD-Level Reasoning:
- O1: 78% on GPQ Diamond (a highly challenging science benchmark)O3: 87.7%
Surpassing the ARC-AGI Benchmark
For years, the ARC-AGI benchmark has served as a kind of North Star for AI researchers. It was designed to test adaptability, generality, and the ability to tackle novel, unseen tasks that humans find straightforward but AI models historically struggled with. GPT-3 scored 0% on it. GPT-4 and its variants barely scraped a few percentage points. Even O1, the previously heralded model, managed only a modest improvement.
Enter O3. On ARC-AGI-1 semi-private evaluations, O3 scored 75.7% at high efficiency (low compute). When allowed to run with “low-efficiency” or high compute (172x more inference budget), O3 soared to an unprecedented 87.5%. Given that human-level performance on the ARC-AGI benchmark was pegged at 85%, O3 didn’t just edge past it—it did so decisively.
This isn’t a trivial improvement. The ARC test was designed to resist memorization and force true, on-the-fly reasoning. O3’s triumph suggests that, at least in the domain measured by ARC-AGI, we have crossed an inflection point that many would label as the emergence of AGI or something very close to it.
“AGI Achieved”: What Does It Mean?
Some experts remain skeptical, pointing out that “AGI” itself is a contentious term. How exactly is AGI defined? Even OpenAI has previously stated that it defines AGI as a system capable of autonomous agency, organizational management, and general problem-solving on par with a human. However, others argue that hitting or surpassing the ARC-AGI human-level threshold is precisely what “AGI” entails—general adaptability, broad competence, and unprecedented problem-solving abilities across domains.
According to Mathew Berman, Matt Shumer, and countless others on X.com as well as many on Reddit r/singularity, “AGI has been achieved.” While one might debate the nuances—such as whether true agency or embodiment is needed for AGI—the unprecedented scores are prompting a reconsideration of definitions. It should also be noted that Open AI does not believe they have reached “Full AGI” (i.e. Level 5 AGI).
Anyway, maybe we don’t need agents managing companies or robots walking the streets to call something AGI. Perhaps general knowledge intelligence—mastery across coding, mathematics, reasoning tasks, and adaptability—counts just as well. From this viewpoint, O3 is already “there.”
Still, O3 has failings on some simple tasks, indicating it is not a perfect mirror of human cognition. This nuance may matter for some researchers and philosophers. Yet the trend line is clear: if O3 isn’t “true AGI,” it has at least brought us to the doorstep of that reality.
Beyond Benchmarks: Towards ASI
The arrival of O3 is not just about hitting one benchmark. It’s about saturating almost all existing benchmarks. The text and data shared today suggest that we now need a new generation of tests. Many high-level challenges—once considered safe havens for human superiority—are collapsing under O3’s prowess.
What’s next after AGI? Some are pointing towards ASI—Artificial Superintelligence. If O3 can do this today, then tomorrow’s O4 or O5 might shatter even more complicated challenges, leaving even the concept of “human-level” behind in the dust. We are entering a new era, and it’s moving faster than anyone could have imagined.
Test-Time Compute and the Cost of Intelligence
A critical innovation behind O3’s remarkable capabilities is the concept of “test-time compute.” O3 can devote enormous amounts of computational effort to solve a single task, exploring massive numbers of reasoning paths. At low-efficiency settings, it already surpasses human-level performance. With 172x more compute, it vaults to new records, such as 87.5% on ARC-AGI.
This is no longer a static LLM passively producing an answer. O3 employs a deliberative, search-based approach at inference time, reminiscent of techniques used in AlphaZero-like systems. It tries out different reasoning steps and uses evaluators to prune away dead ends. This approach transforms the relationship between cost and intelligence: the more compute you can burn, the closer you get to perfect reasoning.
Yes, the current costs are eye-wateringly high. Solving a single ARC-AGI task can cost upwards of $17-$20 at scale. But this is just the beginning. As token prices drop and hardware improves, what is now expensive and rare will become cheaper and more accessible. It’s not hard to imagine a future where test-time compute at O3 levels (or beyond) becomes economically viable for routine use.
The Emergence of O3-Mini: Efficiency Meets Might
O3’s raw power is impressive, but it comes at a cost. To address the practical needs of industry and research, OpenAI is also introducing O3-mini. O3-mini aims to deliver a significant fraction of O3’s performance at a fraction of the cost and latency. This “little beast,” as some insiders call it, promises adaptive thinking time and flexible resource usage. By adjusting how much compute O3-mini invests into a problem, users can tailor cost-performance ratios on the fly.
For businesses, researchers, and individual developers, O3-mini might be the sweet spot. Not everyone needs a full-blown AGI-level performance for every query. Sometimes you just need a good, solid reasoning capability that can handle complex tasks reliably without breaking the bank.
O3-mini’s expected launch by the end of January 2025, and O3’s more widespread availability coming shortly thereafter, will democratize these capabilities. Suddenly, near-superhuman coding, math, and reasoning abilities will be available to many. This change will ripple through economies, job markets, and entire industries.
The Death of Traditional Benchmarks and the Need for New Ones
Nearly every major benchmark we’ve relied on to measure AI’s progress—coding competitions, math tests, complex scientific reasoning tasks—has been saturated or is on the brink of saturation. With O3 surpassing human-level performance on ARC-AGI, a test specifically designed to resist memorization and force genuine reasoning, we are forced to concede that our conventional yardsticks are no longer adequate.
The ARC Prize Foundation and others are already planning for ARC-AGI-2, set to launch alongside the ARC Prize 2025. This new benchmark promises to be even more challenging and better at highlighting where models still fall short of true human-level adaptability. Early predictions suggest that while O3 might score around 87.5% on ARC-AGI-1, it could plunge below 30% on ARC-AGI-2. That means we still have room to stretch these models, to identify their blind spots, and to discover what’s still holding them back from perfect fluid intelligence.
As soon as the new benchmark arrives, we’ll gain a more refined understanding of O3’s capabilities and limitations. The process of building and breaking new tests will continue until it’s no longer possible to create tasks that are easy for humans and hard for AI. When that day comes, the AGI debate will be well and truly settled.
Alignment and Safety: The Next Frontier
What about safety and alignment? O3 is not just about raw intelligence. OpenAI introduced a “deliberative alignment strategy,” using O3’s reasoning abilities to detect unsafe requests and intentions. The new models achieve higher accuracy in rejecting unsafe content while minimizing unnecessary refusals. In theory, O3 is smarter, safer, and more aligned with human values than its predecessors.
However, as the capabilities of AI grow exponentially, so do the potential risks. Alignment and safety are not solved problems. They become increasingly critical as we approach or surpass human-level reasoning. O3’s success raises urgent questions: How do we ensure these models are used ethically? Who decides what values they align with? As these models integrate deeper into society, the stakes for getting safety and alignment right increase dramatically.
OpenAI is inviting safety and security researchers to test O3 and O3-mini. Interested parties have until January 10 to apply for early access. This is a call to the broader research community: we need more minds to probe these models, to push them to their limits, and to ensure that the future of AI is both beneficial and secure.
The Economic and Societal Implications
In a world where O3-level models become common, what does it mean to be a “good programmer”? If a machine can outperform the 175th best coder on the planet, what does that imply for the skilled labor market? Any job that can be done remotely, from behind a computer screen, is now in direct competition with machines that can think faster, code better, and reason more deeply.
This isn’t just about programming. It’s about research, design, planning, and analysis. O3 and future models like it can create a kind of economic upheaval. We might need a new societal framework, a new approach to labor and value creation, if we’re to thrive in a future where human cognitive labor is no longer the limiting factor.
Some see this as cause for alarm; others see opportunity. Just as the Industrial Revolution transformed society, the O3 revolution could open the door to new kinds of work, creativity, and collaboration. AI might enable humans to focus on what we do best—emotional intelligence, interpersonal relations, ethics, strategic thinking—while ceding routine or even complex intellectual tasks to machines.
But make no mistake: the shift will be profound. If your identity and livelihood revolve around coding or problem-solving, you might need to adapt quickly. The world is changing, and O3’s debut is just one early day in what could be the start of a new singularity-like era.
A New Paradigm of Reasoning: Test-Time Program Search
It’s worth digging a bit deeper into how O3 might actually work. While details are limited, the evidence suggests O3 uses test-time search over reasoning paths, also known as “programs.” Unlike older LLMs that rely primarily on patterns in training data, O3 actively explores different solutions at inference time, guided by learned heuristics and evaluators.
This search-based reasoning suggests that O3 can combine known functions and pieces of knowledge to tackle truly novel tasks. Instead of merely retrieving patterns, O3 engages in something resembling program synthesis at runtime. It’s as if O3 can write its own instructions on the fly, evaluating and refining them until it arrives at a correct solution. In domains like ARC-AGI, where novelty is the key challenge, this approach pays off dramatically.
This paradigm shift—away from static “text-in, text-out” and towards active, exploratory reasoning—could be the secret sauce that unlocks what we call “general intelligence.” It also explains why O3 requires so many tokens and so much compute: it’s essentially running a complex search algorithm at inference time.
Not Yet Perfect, But Close Enough to Shock the World
For all its breakthroughs, O3 is not flawless. It still fails on some “easy” tasks that humans find trivial. This indicates that O3’s intelligence, while broad and flexible, is not identical to our own. There may still be deep structural differences between how O3 “thinks” and how a human mind operates. The upcoming ARC-AGI-2 benchmark and other future tests will help highlight these differences.
In some sense, these failures prove that we can still build new tests and challenges to poke at O3’s blind spots. We haven’t fully run out of “human-easy, AI-hard” tasks yet. But the speed at which we’re running out of them is startling. It took four years for GPT-3 to move from 0% on ARC-AGI to O3’s near-human performance. The curve of progress is bending steeply upwards.
If O3 is not AGI, it’s at least making previous definitions and estimates of AGI timelines obsolete. The acceleration is undeniable. We are in new territory.
What Comes Next?
The ARC Prize Foundation plans to launch ARC-AGI-2 soon, a new benchmark designed to reset the playing field and challenge models like O3 even further. Researchers anticipate O3 might score only ~30% there, reaffirming that we have not yet reached a fully saturated state of general intelligence.
The open-source community will be crucial in understanding O3’s behavior. Data from O3’s attempts on various tasks is being released, and analysis is encouraged. By understanding where O3 fails, we might glean insights into what’s missing and what the next generation of models must overcome. This process—of building a test, having a model excel, and then building a harder test—is the iterative engine driving AGI research forward.
Ultimately, we may soon reach a point where it is impossible to create tasks that are easy for humans but consistently hard for advanced AI. When that day comes, the AGI debate will be settled in the strongest possible terms. Until then, we’ll keep raising the bar.
Conclusion: Living Through the Early Days of the Singularity
Today’s announcement shatters the comfort of incremental progress. We’ve long discussed AGI in hypothetical terms, placing it somewhere over the horizon. But according to the data and claims presented, O3 might have just crossed a line many considered impossible to cross in 2024.
We must now live with the consequences. Economies will need to adjust. Education will need to adapt. Research and development pipelines will change drastically as human experts find themselves outclassed in certain intellectual domains. We are facing not just a new tool, but a new era in intelligence—one that might ultimately dwarf our biological constraints.
These truly feel like the early days of the singularity. The world you knew yesterday is not the same world you live in today. O3 has arrived. The future is here, and we must be ready to redefine what it means to be human, what it means to work, to learn, to reason, and to create.
Sources
- ARC Prize Foundation Blog: https://arcprize.org/blog/oai-o3-pub-breakthrough
(Data and information on ARC-AGI benchmarks and O3’s performance) - Codeforces Competitive Programming Platform: https://codeforces.com
(Referenced for Elo rating benchmarks in programming competitions) - Francois Chollet on OpenAIs o3 breakthrough – https://x.com/fchollet/status/1870169764762710376
- Deliberate Alignment Strategy – https://openai.com/index/deliberative-alignment/
Comments 2