Why Language Models Hallucinate - OpenAI Paper Summary

Large language models don’t “see” the world. They model it—statistically, hungrily, and at scale. So when they produce confident falsehoods—hallucinations—it can feel like a betrayal: an articulate guess packaged as truth. The latest analysis from Adam Tauman Kalai (OpenAI), Ofir Nachum (OpenAI), Santosh S. Vempala (Georgia Tech), and Edwin Zhang (OpenAI) cuts through the mystique. Their claim is sober and sharp: hallucinations are not mysterious side effects. They’re statistical inevitabilities of how we pretrain models and, just as importantly, how we evaluate them after.

The paper argues two main points:

Pretraining inherently induces errors (including hallucinations), even on error-free corpora, because generation is statistically harder than classifying validity.
Post-training pipelines and benchmarks overwhelmingly reward guessing over abstaining, turning models into relentless test-takers that bluff—because the grading rules say bluffing wins.

This article summarizes the paper’s core arguments, clarifies the math without drowning in it, and surfaces the socio-technical fix the authors propose: change how we score mainstream benchmarks so that abstention isn’t punished and calibrated uncertainty isn’t sacrificed at the altar of accuracy.

Along the way, we’ll reference the exact resources the authors cite, like the GPT-4 technical report’s calibration findings (OpenAI, 2023a), prominent benchmarks such as GPQA (Rein et al., 2024), MMLU-Pro (Wang et al., 2024), WildBench (Lin et al., 2025), and SWE-bench (Jimenez et al., 2024)—not to lionize benchmarks, but to expose how their binary scoring quietly entrenches hallucination.

why-language-models-hallucinate Download

A crisp problem statement

Models still hallucinate—now, today, in their latest incarnations (OpenAI, 2025a). Ask a fact that’s either rare or absent in their training data—say, a specific person’s birthday with a strict format—and models will confidently fabricate. Ask “How many Ds are in DEEPSEEK?” and you’ll often get wrong counts.

These aren’t accidental flourishes; they reflect deep statistical pressures. And even when post-training nudges models toward harmlessness or instruction-following (e.g., RLHF; Ouyang et al., 2022), we still observe stubbornly overconfident behaviors—because the tests they’re optimized for reward them.

The paper’s core thesis splits cleanly in two:

Pretraining produces a base model that, by its very objectives, will generate errors—including plausible falsehoods—without any malicious intent or architecture-specific quirk.
Post-training does not reliably disinfect these errors, because the dominant evaluation regimes penalize abstention (“I don’t know”) more than wrong but confident answers. That incentivizes bluffing.

Let’s unpack both.

Part I: Pretraining makes errors mathematically inevitable

The paper builds a surprisingly elegant bridge: it reduces generative errors to a supervised learning problem called IIV—Is-It-Valid. Here’s the idea. Imagine taking a model’s potential outputs and labeling them “valid” (+) or “error” (–). If you could classify validity perfectly, you’d never generate errors: you’d only sample from the valid set. But generation is harder than classification: to generate, the model implicitly answers “Is this valid?” for every plausible candidate, not just for a single choice.

Formally, let the model be a probability distribution over plausible outputs X partitioned into V (valid) and E (errors). Let p be the (clean) training distribution; let p̂ be the pretrained model. The paper defines an IIV classifier based on thresholding the model’s probabilities. Then it relates the model’s generative error rate to the IIV misclassification rate. In its simplest (no-prompt) form, the relationship centers on this intuition:

If a classifier (made from the generative model) mislabels a bunch of items, then the generative model will, with at least double that rate, spill probability mass into the error set during generation.

The paper phrases it precisely and, with prompts included, states a main theorem (Theorem 1) yielding a lower bound on generative errors in terms of IIV misclassification plus calibration terms and the relative sizes of valid/error response sets per context. The high-level takeaway: if valid and invalid responses are hard to separate statistically, then sampling-based generators will make mistakes—inevitably.

The punchline is not a quirk of transformers, decoding, or next-token prediction. It’s broader: fitting a distribution (density estimation) to language where valid and invalid are interleaved implies error. Even fantasy-free training data won’t save you. Why? Because many “facts” have no learnable pattern at the level the model sees them—think birthdays, minor biographies, ephemeral or idiosyncratic details.

The result dovetails with and generalizes earlier analysis of arbitrary facts and missing mass (Kalai & Vempala, 2024). In that work, facts that appear exactly once in training predict a lower bound on hallucinations at inference. This paper absorbs that intuition via the IIV reduction and strengthens it to include prompts and explicit abstention.

Calibration enters the chat

One of the most interesting threads involves calibration. The bound depends on a small delta term (δ) that reflects a form of miscalibration: the mismatch between how often the model says, “this is above the threshold” and how often that’s true under the real distribution. The paper argues that cross-entropy pretraining tends to shrink δ. In fact, even a trivial “uniform over X” distribution yields δ = 0 at a single threshold, so δ being small isn’t exotic.

Empirically, pretrained base models are often well-calibrated, whereas post-training can make calibration worse—see the GPT-4 calibration histograms before vs. after reinforcement learning (OpenAI, 2023a, Figure 8). Pretraining encourages calibration; calibrated models, per the paper’s math, must assign non-negligible mass to error regions when the boundary is statistically hard to discern. In short: calibrated base models must make some errors.

That reframes a popular assertion: “Hallucinations are inevitable.” The authors refine it—inevitable for base models whose objective is density estimation and whose calibration is good. One could make a model that never hallucinates by refusing to answer, or by acting as a tiny closed-book Q&A oracle with IDK as the default. But that would fail at density estimation and be useless at broad language generation. Completeness vs. correctness; breadth vs. consistency—trade-offs theorized elsewhere (Kleinberg & Mullainathan, 2024; Kalavasis et al., 2025).

With prompts: the general reduction (Theorem 1)

Real models answer in context, so the paper extends the reduction to prompts c with response sets Rc and partitions into valid Vc and erroneous Ec. The error bound becomes:err ≥ 2⋅errIIV − max⁡c∣Vc∣min⁡c∣Ec∣ − δ,\text{err} \;\ge\; 2 \cdot \text{err}_{\mathrm{IIV}} \;-\; \frac{\max_c |\mathcal{V}_c|}{\min_c |\mathcal{E}_c|} \;-\; \delta, err≥2⋅errIIV−minc∣Ec∣maxc∣Vc∣−δ,

where δ again measures a calibration gap at a threshold set by the smallest error set across contexts. Intuitively, if there are many ways to be wrong (Ec large) and few to be right (Vc small), then even small misclassification rates force non-trivial generative error.

In cases like factoid questions with a single correct answer plus IDK, Vc is tiny compared to Ec (e.g., birthdays with fixed formatting: 364 wrong days vs. 1 right answer plus IDK). If the model can’t perfectly tell valid from invalid, it must allocate some probability to errors when sampling. The math says: there’s nowhere else for that probability mass to go.

Arbitrary facts and the singleton rate (Theorem 2)

Here’s where the Good–Turing “missing mass” intuition lights up the theory. Suppose for many prompts the correct answer is essentially arbitrary from the model’s perspective (arbitrary facts), and sometimes the right response is IDK. Define the singleton rate sr as the fraction of prompts that appear exactly once in the training data with a non-IDK answer. Then, with high probability over training draws, any algorithm that outputs a calibrated model must have generative error at least roughly sr (up to small terms and scaling by the number of incorrect choices per prompt). More precisely, the paper shows a high-probability lower bound of the form:err ≥ sr − 2min⁡c∣Ec∣ − O ⁣(ln⁡NN) − δ.\text{err} \;\ge\; sr \;-\; \frac{2}{\min_c |\mathcal{E}_c|} \;-\; O\!\left(\frac{\ln N}{\sqrt{N}}\right) \;-\; \delta. err≥sr−minc∣Ec∣2−O(NlnN)−δ.

Interpretation: if a large fraction of facts appear only once, and there are many more wrong responses than right ones per prompt, then hallucinations at or above sr are unavoidable for a calibrated base model. This formalizes why rare facts are fragile: you can’t generalize to millions of specifics that were barely seen, and you can’t memorize them all either.

Poor models, hard concepts (Theorem 3 and Corollary 2)

Not all errors are arbitrary-fact errors. Some are induced by representational limits or misfitting. If your family of classifiers can’t represent the boundary between valid and invalid well, you’ll misclassify—a standard agnostic learning insight (Kearns et al., 1994). The paper uses a multiple-choice setup (one correct answer per context) to show that if the best thresholded classifier built from a model family has misclassification opt(𝒢), then the generative error must be large—specifically:err ≥ 2 ⁣(1−1C) ⁣⋅opt(G),\text{err} \;\ge\; 2\!\left(1-\frac{1}{C}\right)\! \cdot \text{opt}(\mathcal{G}), err≥2(1−C1)⋅opt(G),

where C is the number of choices. Concrete example: classic trigram language models can’t disambiguate certain pronoun–noun dependencies. Under a simple construction, any trigram model must err at least 1/2 of the time (Corollary 2). This helps explain intrinsic failures like letter counting in tokenized prompts: when the model class encodes tokens (e.g., D/EEP/SEE/K) instead of character-level structure, “count the Ds” can become awkward—until you change the model (e.g., to a reasoning-optimized variant) that actually computes the count.

More contributors: distribution shift, GIGO, computational hardness

Distribution shift: prompts at test-time often diverge from training distributions (Quiñonero-Candela et al., 2009; [Moreno-Torres et al., 2012]). Undergraduate riddles like “Which is heavier, a pound of feathers or a pound of lead?” can trigger incorrect associations in certain models.
GIGO (garbage-in, garbage-out): pretrained corpora contain errors; base models pick them up and propagate them (Lin et al., 2022b; Levy et al., 2021; Alber et al., 2025).
Computational hardness: some prompts hide cryptographic or complexity-theoretic impossibilities. The paper shows a stylized reduction where answering “decrypt this” correctly without a key would mean breaking a secure cryptosystem; calibrated models must fail in those regimes (see the formalization referencing [Goldreich, 2001]).

All of these feed the same conclusion: pretraining—even on clean data—induces errors for deep, principled reasons. And that’s only half the story.

Part II: Post-training pipelines keep hallucinations alive—because our tests do

So why don’t RLHF, DPO, and friends finish the job? Plenty of work shows post-training can reduce common misconceptions, conspiracy uptake, and various falsehood patterns (Ouyang et al., 2022; Bai et al., 2022; Rafailov et al., 2023; Tian et al., 2024). Yet the paper’s analogy nails it: language models are trained to be good test-takers. And most tests penalize blank answers as harshly as wrong answers. That’s a recipe for guessing.

Try on the student analogy: on a 0–1 graded exam, a rational student should guess when unsure. Models are rational test-takers too. If benchmarks track accuracy or pass rate with no credit for abstaining, your model does better by bluffing—often with specificity—than by saying “I don’t know.”

The authors inspect influential benchmarks and leaderboards. The verdict: binary grading dominates. Benchmarks like GPQA (Rein et al., 2024), MMLU-Pro (Wang et al., 2024), BBH (Suzgun et al., 2023), Omni-MATH (Gao et al., 2024a), MuSR (Sprague et al., 2024), SWE-bench (Jimenez et al., 2024), and others primarily grant zero for abstentions. Even evaluation setups that use LM judges may occasionally count confident but incorrect long-form answers as “correct,” compounding the incentive to bluff. WildBench (Lin et al., 2025) offers a more nuanced 10-point scale, but even there, an IDK response might score worse than a “fair” but partially incorrect answer—meaning that, on average, a strategic model keeps guessing.

The paper’s observation is formal and sharp: under binary graders that give 1 for correct and 0 for anything else—including IDK—the expected score-maximizing strategy is never to abstain. That’s obvious; it’s also devastating. If almost every leaderboard and “flagship” capability suite uses that scoring, then every optimization pipeline that cares about those leaderboards—directly or indirectly—will favor hallucinatory behavior when uncertain.

This is a socio-technical trap. Even if you invent the perfect hallucination eval, it will get drowned by the dozens of existing benchmarks that define model status, hiring, marketing, and progress. If those benchmarks punish uncertainty, aligned models that honestly hold their tongue when unsure will rank lower than their bluffy twins.

The proposed fix: make confidence explicit in mainstream evaluations

The authors don’t ask for a new hallucination leaderboard. They ask to tweak the mainstream ones to stop punishing abstention. The proposal is simple, pragmatic, and (importantly) objective:

Add explicit confidence targets to the instructions of existing evaluations.
Specify a penalty for incorrect answers that depends on a threshold t (e.g., t = 0.5, 0.75, 0.9), with IDK scoring 0.

Concretely, a prompt might end with:

“Answer only if you are > t confident, since mistakes are penalized t/(1−t) points, while correct answers receive 1 point, and an answer of ‘I don’t know’ receives 0 points.”

This creates a transparent scoring rubric where abstentions are rational when confidence is low. It also captures a “universal” optimal behavior across thresholds: respond when your internal probability of being correct exceeds t; otherwise abstain. The authors call this behavioral calibration. It avoids relying on explicit numeric confidence outputs (which can be awkward or gamed) and measures the thing we care about: Did the model speak only when it should?

Note the subtle but crucial difference from prior work: it’s not enough to build bespoke “risk-aware” or “uncertainty” tasks. Those remain sidelined if the big boards—the ones everyone watches—still grade in 0–1. Incorporating confidence targets into the widely used benchmarks (like SWE-bench, GPQA, and MMLU-Pro) could realign the field’s incentives. Rather than chasing a perfect hallucination eval, fix the exams the models already study for.

For completeness and context, several related lines of research complement this direction: semantic entropy for detecting hallucinations (Farquhar et al., 2024); methods to express uncertainty in words (Lin et al., 2022a; Mielke et al., 2022); RAG to ground claims where possible (Lewis et al., 2020; Shuster et al., 2021; Nakano et al., 2021); and evaluations like WildBench that increase realism (Lin et al., 2025). But the lever that changes behavior at scale, per the authors, is straightforward: stop penalizing abstention in the exams that matter.

Practical implications: what changes if we adopt confidence targets?

Benchmarks would report performance curves or aggregates by threshold t (e.g., t ∈ {0.5, 0.75, 0.9}).
Models that frequently say “I don’t know” when they should would no longer be punished. In fact, they’d be rewarded relative to bluffers whose accuracy dips below the stated threshold.
Developers could train or select models that are behaviorally calibrated—answering when their chance of correctness beats the penalty. This aligns with evidence that models often encode useful uncertainty signals internally (Kadavath et al., 2022).
Overconfident models would lose ground. Under t = 0.9, for instance, the penalty for a wrong answer is 9. It becomes rational to abstain often in hard regimes.
Importantly, this does not freeze progress. It reframes “win conditions”: improving calibration and retrieval, boosting knowledge coverage, reasoning better—all drive higher expected scores under explicit confidence targets.

Would this eliminate hallucinations entirely? No—base-model error sources remain, and post-training still faces trade-offs. But the strategic incentive to bluff would be dampened, and evaluation scores would reflect trustworthiness rather than raw guesswork.

Limitations and scope, per the authors

The framework abstracts from many complexities:

Plausibility vs. nonsense: The analysis narrows on plausible strings; nonsense generations are rare for SOTA models and can be plugged into the error set without changing the thrust of the main theorem.
Open-ended generation: In long-form outputs (e.g., biographies), the notion of “one falsehood = error” is a simplification; a graded notion of severity is reasonable but orthogonal to the core bound.
RAG and reasoning help but don’t nullify incentives: If grading stays binary, guessing still dominates when retrieval fails or uncertainty is high.
Latent context and ambiguity: Some “errors” are mismatches in intent; formalizing hidden context would require extensions that engage with aleatoric uncertainty.
IDK is not the only uncertainty move: Hedging, skipping dubious details, or asking clarifying questions matter pragmatically and can be integrated into graded rubrics (see linguistic calibration: Mielke et al., 2022; Damani et al., 2025).

The authors keep the target narrow: a statistical explanation for why base models err, and a socio-technical insight into why post-training doesn’t kill hallucinations in practice.

A few concrete examples the paper highlights

Arbitrary biography-type facts: Ask for Adam Kalai’s birthday in “DD-MM” format with “only answer if you know.” Multiple SOTA models replied with different concrete dates—all wrong. This is quintessential arbitrary-fact hallucination: the model doesn’t “know,” and the space of plausible wrong answers is much larger than the singleton right one.
Counting Ds in DEEPSEEK: Some models answered “2” or “3,” or other incorrect numbers, reflecting tokenization and representational challenges for character-level tasks. Notably, reasoning-optimized variants can fix this by actually computing the count rather than sampling a token sequence that “sounds right.”

These aren’t cherry-picked. They illustrate the broader categories of errors the theory anticipates: (1) epistemic gaps for arbitrary facts; (2) representation/misfit failures; and (3) cases where even perfect calibration can’t overcome cryptographic/complexity barriers.

Where this leaves the field

The paper provides a unifying lens. Generative errors, including hallucinations, inherit the well-understood statistical limits of classification. Calibration makes those errors surface honestly. And the incentives we embed in our evaluations—accuracy over alignment with uncertainty—shape post-training behaviors.

The recommended mitigation is crisp: integrate explicit confidence targets into the widely used benchmarks—GPQA, MMLU-Pro, SWE-bench, HLE, and others—so that models aren’t penalized for abstaining under uncertainty. This change doesn’t require new datasets, new graders, or bespoke scoring wizardry. It requires editing instructions and scoring rules—then sticking to them.

A few references the paper uses (for context and further reading):

GPT-4 Technical Report with calibration plots (OpenAI, 2023a)
Constitutional AI post-training (Bai et al., 2022)
IFEval instruction-following (Zhou et al., 2023)
GPQA benchmark (Rein et al., 2024)
MMLU-Pro (Wang et al., 2024)
WildBench (Lin et al., 2025)
SWE-bench (Jimenez et al., 2024)
RAG overview (Lewis et al., 2020)
Semantic entropy for hallucination detection (Farquhar et al., 2024)
Calibrated LMs must hallucinate (Kalai & Vempala, 2024)
DeepSeek-R1 reasoning model (DeepSeek-AI et al., 2025)
OpenAI’s system cards and capability posts (OpenAI, 2025a, 2025c, 2025d, 2024, 2023b)

The key equations, in plain view

The core reduction ties generative error to misclassification on validity:

With prompts, for any base model p̂ and clean training distribution p (all training examples valid), the paper shows:

err ≥ 2⋅errIIV − max⁡c∣Vc∣min⁡c∣Ec∣ − δ.\text{err} \;\ge\; 2 \cdot \text{err}_{\mathrm{IIV}} \;-\; \frac{\max_c |\mathcal{V}_c|}{\min_c |\mathcal{E}_c|} \;-\; \delta. err≥2⋅errIIV−minc∣Ec∣maxc∣Vc∣−δ.

In the “arbitrary facts” regime with IDK allowed and singleton rate sr (the fraction of prompts appearing exactly once with a non-IDK answer), with high probability over training data:

err ≥ sr − 2min⁡c∣Ec∣ − O ⁣(ln⁡NN) − δ.\text{err} \;\ge\; sr \;-\; \frac{2}{\min_c |\mathcal{E}_c|} \;-\; O\!\left(\frac{\ln N}{\sqrt{N}}\right) \;-\; \delta. err≥sr−minc∣Ec∣2−O(NlnN)−δ.

For multiple-choice settings (one correct answer per prompt), if the best thresholded classifier in a model family has misclassification opt(𝒢), then:

err ≥ 2 ⁣(1−1C)⋅opt(G),\text{err} \;\ge\; 2\!\left(1-\frac{1}{C}\right) \cdot \text{opt}(\mathcal{G}), err≥2(1−C1)⋅opt(G),

where C is the number of options.

These are lower bounds; they don’t say when models will do worse, only that, statistically, they cannot do better under the given assumptions.

Final take

The paper reframes hallucinations as the predictable shadow of two forces:

Statistical reality: Base models trained by cross-entropy to fit language distributions, and calibrated by that training, must spill probability into error regions when validity is not cleanly separable (arbitrary facts; representational mismatch; computational hardness).
Social reality: Post-training optimizes for scores on benchmarks that mostly punish abstention. So models guess. Confidently.

The authors’ remedy is not a new leaderboard—it’s a small but potent change to how we grade the leaderboards we have. If we explicitly reward calibrated abstention—via clear confidence targets and penalties—models will learn to shut up when they should. Their accuracy will be judged not in a vacuum, but under a rational risk policy that aligns with real-world deployment.

That’s not just a statistical fix. It’s a norm-setting move. And it might be the simplest way to bend the arc of model behavior toward trust