In the swiftly evolving field of artificial intelligence (AI), benchmarks act as signposts, guiding our understanding of what large language models (LLMs) can and cannot do. Over the past few years, these benchmarks have become indispensable to measure and compare the ever-accelerating progress of state-of-the-art models. Yet traditional benchmarks, such as MMLU, are rapidly losing their discriminative power: many of today’s advanced models score well above 90%, rendering these tests nearly obsolete for meaningful performance differentiation. In response to this looming saturation crisis, a team of nearly 1,000 subject-matter experts, across 500 institutions spanning 50 countries, has developed an ambitious new benchmark called Humanity’s Last Exam (HLE). This multi-modal dataset comprises 3,000 challenging, closed-ended questions in over a hundred academic subjects. By design, HLE pushes at the frontiers of human knowledge—aiming to be the “final” high-level academic exam needed to appraise the next generations of LLMs.
This summary will explore the motivations, methodology, key findings, and implications surrounding Humanity’s Last Exam. Drawing on the collective text and data shared by its creators, the narrative below underscores why HLE constitutes a leap forward in benchmark rigor. Alongside providing an overview of its dataset construction, this summary delves into how current frontier models performed (spoiler: they struggled), how calibration error signals deeper issues of “hallucination,” and why the research community deems HLE not simply a measure of factual recall but also an evolving yardstick for advanced reasoning capabilities. Ultimately, HLE highlights both how far LLMs have come and how much farther they must go to reach truly expert-level performance on the spectrum of human academic knowledge.
1. The Benchmark Dilemma: Why Another Test?
In the AI community, benchmarks are both beloved and beleaguered. On the one hand, they help unify progress measurement, enable comparisons between different architectures, and motivate the design of novel training procedures. On the other hand, they have a finite shelf life. As soon as models approach or surpass human-level accuracy on a task, that task can no longer effectively differentiate new systems at the high end. For instance, popular benchmarks like MMLU (Measuring Massive Multitask Language Understanding) once seemed formidable. However, in the current moment, advanced LLMs consistently score above 90%—making MMLU a plateau rather than a peak.
Recognizing this phenomenon, the creators of Humanity’s Last Exam propose a new standard. Their intent, as explicitly stated, is not to perpetually churn out more benchmarks but rather to craft a definitive, high-difficulty assessment that can remain relevant even if models rapidly improve. The name “Humanity’s Last Exam” signals the project’s aspiration that no further high-level, closed-ended academic benchmark will be necessary once LLMs exhibit superlative performance on HLE.
At its core, the impetus for HLE stems from the frontier effect: as LLMs become more powerful, they easily “ace” older exams. This phenomenon was starkly revealed when GPT-4 soared on previously challenging sets of questions. If the field cannot keep pace in designing more advanced tests, the community risks losing an essential feedback mechanism for gauging progress. Humanity’s Last Exam thus aims to bring greater difficulty, wide subject coverage, and multi-modal tasks to reflect real-world academic complexity, ensuring that the test remains “hard to beat” for years to come.
2. Constructing Humanity’s Last Exam
2.1. Global, Expert-Driven Collaboration
Unlike many prior benchmarks assembled by small research groups, HLE emerges from a vast collective endeavor. Nearly 1,000 contributors—largely professors, researchers, and graduate-degree holders—across 500 institutions worldwide collectively provided questions in their areas of expertise. The breadth of involvement ensures that HLE covers not only standard subjects like mathematics, physics, and chemistry, but also specialized niches ranging from cryptography to historical linguistics, astrophysics to anthropology, evolutionary biology to theoretical computer science.
By weaving in knowledge from so many domains, the HLE creators sought to reduce the possibility of “data exploitation,” wherein a model might rely on a narrow training phenomenon or leak from an already-solved domain. With 3,000 total questions spanning more than a hundred subjects, the exam is expansive and demands deep command over a wide swath of human knowledge.

2.2. Multi-Modal Questions
Another hallmark of HLE is its multi-modal nature. Many previous benchmarks limit themselves to textual input, but genuine academic aptitude often requires interpreting graphs, diagrams, images, or complex symbolic expressions. HLE steps beyond text-only boundaries by incorporating visual and symbolic inputs where necessary. For instance, a question about molecular chemistry might include diagrams of molecules and ask for identification of a specific chemical reaction. Or a geometry problem might show a figure to be analyzed. By challenging models to handle multiple data representations, HLE aims to approximate the kind of tasks a graduate-level student might encounter in real-world exams, bridging an important gap between purely text-based benchmarks and the more integrated demands of advanced scholarship.
2.3. Question Format
While many AI benchmarks are open-ended, inviting free-form generation of explanations, Humanity’s Last Exam is deliberately closed-ended. This design choice emphasizes verifiable correctness and reduces the interpretative ambiguity that can arise with creative or open-ended responses. Each question typically has a small set of possible answers (often single best answer or multiple choice), ensuring that there is a clear metric for whether the model is correct or incorrect. This approach matches the authors’ declared mission: to measure “expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge.” Indeed, HLE is not intended to test open-ended research capacity or creative problem-solving—a different frontier altogether. Instead, it homed in on advanced factual and conceptual knowledge, reasoning accuracy, and the ability to avoid confabulation in a multi-modal context.
3. Comparing HLE with Existing Benchmarks
3.1. Difficulty Dynamics
One of the central themes of the HLE project is that existing benchmarks are not “difficult enough.” That conclusion is drawn from the high saturation observed on popular tests, such as MMLU, where advanced LLMs easily achieve near-perfect accuracy. In contrast, on HLE’s set of 3,000 questions, current top-tier models barely scratch 10% accuracy in the best cases, with some scoring as low as 3% accuracy. This wide gulf strongly indicates that HLE is indeed addressing an under-explored level of difficulty.
From a methodological viewpoint, demonstrating that top models still flounder on HLE underscores the utility of this new benchmark. If a test reveals minimal variance at the high end (e.g., everyone scoring 95%+), it fails to differentiate. HLE flips that script: the results show there is no immediate risk of saturating this benchmark. The margin for improvement between, say, GPT-4’s 3.3% and potential mastery is enormous, encouraging more advanced model-building efforts.
3.2. The Public/Private Split
An important design choice by the HLE authors was to publicly release the majority of their curated questions but retain a “hidden” test subset. This protects against overfitting or memorization by LLMs that train or fine-tune on the publicly available data. By limiting certain questions to an internal repository, the authors can continuously evaluate new model submissions without worrying that the exam has been contaminated in the training data. This step reflects the lessons learned from earlier benchmarks, which saw unscrupulous or unintentional “leaks” that rendered the test scores questionable.
4. Performance of Frontier Models
4.1. Low Accuracy Across the Board
The creators of HLE tested a variety of well-known frontier models, including GPT-4, Grok-2, Claude 3.5 Sonnet, Gemini Thinking, “o1,” and DeepSeek-R1 (the last of which is text-only). Strikingly, none of these models managed to climb above 10% accuracy. While the exact numbers varied slightly—DeepSeek-R1 leads at 9.4%, while GPT-4 is at 3.3%—the overarching takeaway is that HLE’s formidable difficulty exposes large gaps in advanced reasoning and domain-specific knowledge.
It is intriguing that these same models may be celebrated for scoring exceedingly high on older benchmarks. GPT-4, for example, famously outperforms many prior systems on tasks ranging from code generation to creative writing to complex reasoning on certain tests. That it slumps to 3.3% on HLE illuminates a crucial conclusion: LLM performance on extant benchmarks does not equate to robust, universal knowledge mastery.
4.2. Calibration Error
Besides raw accuracy, the HLE study also addressed a core question: Do models “know” when they do not know? In other words, can they accurately calibrate their confidence level for each question? To evaluate this, the authors prompted each model to provide a confidence score (ranging from 0% to 100%) alongside its chosen answer. The subsequent metric—Calibration Error—quantifies the discrepancy between predicted probabilities and actual correctness.
The results here were equally sobering. Despite low accuracies, the models mostly exuded unwarranted confidence, with calibration errors hovering around 90% for GPT-4 and others. DeepSeek-R1 performed slightly better on calibration (81.8%), but even that is considered poor for a high-stakes exam setting. This misalignment suggests that while LLMs often produce fluent text that sounds authoritative, they lack genuine self-awareness of their knowledge boundaries. The potential ramifications include:
- Hallucination (confidently asserting incorrect facts).
- Confabulation (making up information to fill knowledge gaps).
- Misleading Users (providing a veneer of certainty about incorrect statements).
Thus, HLE not only measures knowledge; it casts a harsh light on how poorly these systems self-evaluate in the face of advanced, specialized tasks.

5. Implications and Future Outlook
5.1. Near-Future Performance Surge?
One of the most compelling lessons from AI history is that progress on benchmarks, even seemingly insurmountable ones, can accelerate at a dramatic pace. For instance, tasks that once seemed impossible for neural networks—like beating professional Go players—were achieved in a remarkably short timespan. Observing these leaps, the authors of HLE caution that models currently floundering on the exam may well surpass the 50% accuracy threshold in the next couple of years.
Should LLMs climb toward mastery on HLE, it would demonstrate a truly expert-level command of structured academic knowledge and reasoning. However, the authors emphasize an important nuance: excelling at closed-ended, verifiable questions—no matter how complex—should not be conflated with developing open-ended research abilities or a capacity for autonomous scientific discovery. In short, success on HLE does not necessarily mean “artificial general intelligence” has arrived; it merely marks an advanced stage of factual knowledge mastery and domain reasoning.
5.2. “The Last Academic Exam” but Not the Last AI Benchmark
By positioning HLE as the “last exam,” its creators highlight that once LLMs can reliably handle these advanced, domain-specific queries, additional high-level tests for closed-ended knowledge may be superfluous. Nonetheless, benchmark research does not end with HLE. There remain other types of tasks—e.g., creative writing, problem-solving requiring multi-step innovation, real-world planning under uncertainty, dynamic dialogues that evolve with context, and unstructured tasks like designing scientific experiments. These open-ended fronts require different evaluation frameworks and cannot be neatly contained in a single “test.” The authors thus see HLE as a culminating effort in one subdomain—rigorous academic questioning—while acknowledging a larger landscape of capabilities that will need equally rigorous frameworks.
6. Societal Impact and Governance Considerations
6.1. Informing Policymakers and the Public
One of the paper’s most notable claims is that Humanity’s Last Exam can serve as a touchstone for scientists, policymakers, and the general public. As LLMs make headlines and raise both hopes and fears, having a robust, transparent metric of advanced academic performance can anchor more informed debates. For instance, if a model that scores 90% on older benchmarks proves to get only 5% correct on HLE, the gap between hype and reality becomes clearer.
Conversely, should a model soon achieve 50% or higher on HLE, the milestone may prompt urgent policy conversations about advanced AI deployment, potential job displacement in specialized knowledge sectors, and ethical guardrails. By tracking performance on HLE over time, regulators and the public can contextualize claims of “near-human or above-human intelligence,” especially when it comes to academic and scientific proficiency.
6.2. Responsible Benchmark Maintenance
Crucially, the HLE team is also aware that even the “final exam” can be corrupted if the questions leak. The conscientious maintenance of a private subset of questions and the ongoing methodology for rotating or refreshing these items aim to keep the benchmark challenging and relevant. This aligns with the broader call for responsible publication of new AI capabilities, ensuring that metrics remain untainted by unscrupulous attempts to game or memorize the exam.
Given that LLMs can ingest vast textual corpora, including publicly released benchmarks, the multi-modal complexity of HLE further protects against trivial memorization. The authors’ hope is that a robust, carefully maintained exam fosters healthy competition among AI developers, spurring genuine innovation over superficial “benchmark hacking.”

7. Conclusion
Humanity’s Last Exam boldly declares itself the culminating academic benchmark for large language models, and its initial results suggest that it may well live up to its name in the closed-ended domain. By bringing together 3,000 meticulously crafted, high-difficulty, multi-modal questions spanning over 100 subjects, HLE reveals glaring holes in today’s most advanced AI systems. Models that otherwise dazzle the public with their near-flawless performance on older tests stumble to single-digit accuracy here. Compounding that, the observed calibration errors indicate a deeper inability to self-assess confidently or remain humble about uncertain answers, reinforcing concerns about AI “hallucinations.”
Yet HLE does not aim merely to highlight failures. The authors anticipate a rapidly evolving AI landscape, wherein future models might vault from near-zero to near-perfect scores on HLE in the span of a few years. If or when that time comes, the field will have a new baseline for “expert-level” academic mastery on closed-ended queries, reminding us that while rote or structured knowledge can be automated, truly general or creative intelligence remains another frontier.
In the broader ecosystem of AI evaluation, HLE stands as a sentinel. It guards against complacency, reminding developers, policymakers, and the global public that ubiquitous illusions of “AI super-intelligence” must be tested on the toughest queries. By maintaining a private test set, encouraging a multi-modal approach, and emphasizing calibration as much as raw accuracy, HLE covers dimensions often overlooked in the standard, mostly textual benchmarks. Its bold and somewhat provocative name hints that we may be approaching a golden era: if an AI system can, indeed, pass humanity’s hardest exam, the definitions of knowledge, expertise, and the role of human scholars may need reevaluation.
However, even if some future system achieves mastery of HLE, that triumph will not settle grander questions of whether AI has arrived at “artificial general intelligence” or whether it possesses creative or generative faculties on par with humans. Instead, it will tell us that machines can learn and accurately recall advanced technical and scientific facts and can produce correct answers to academically oriented, multi-modal problems. That is no small feat. But as the authors clarify, it is but one step along an infinite continuum of what intelligence can accomplish.
In sum, Humanity’s Last Exam is best understood as a linchpin in the evolving story of AI evaluation. It brings academic rigor back to the testing environment. It sets a bar high enough that state-of-the-art models struggle significantly. It lays out a method for responsibly tracking future breakthroughs without succumbing to data leaks or short-lived illusions of progress. And it gives shape to a crucial conversation: with progress in AI accelerating, how do we responsibly measure, interpret, and guide the blossoming capabilities of these systems?
Thus, the final message rings loud and clear: HLE stands as a sentinel for truly advanced closed-ended knowledge assessment, a test bed of formidable scope meant to last beyond fleeting benchmark cycles. All the while, it acknowledges that from creativity to autonomous research, from generative breakthroughs to ethical considerations, much remains outside the scope of this exam. In that sense, Humanity’s Last Exam is no doomsday measure—merely a robust, if formidable, academic yardstick to keep us honest about the real frontiers of AI achievement.