GDPval is an OpenAI evaluation of real professional “knowledge-work” on computers. It contains 1,320 expert-authored tasks spanning 44 occupations across 9 GDP-heavy sectors, with a public 220-task gold subset for head-to-head comparisons between human professionals and frontier models. Each task looks like actual work: a prompt + reference files (spreadsheets, PDFs, slide decks, audio/video, CAD, etc.) and a required deliverable. Experts graded outputs via blinded pairwise comparisons; OpenAI also built an experimental automated grader that roughly matches human agreement levels.
Headline results: models are approaching parity with industry experts on many tasks. On the gold subset, Claude Opus 4.1 topped overall (especially on aesthetics and document layout), while GPT-5 led on accuracy and instruction-following. Roughly half of model submissions were graded at least as good as the human expert baselines; e.g., 47.6% of Claude Opus 4.1 deliverables were wins or ties against humans. Performance of OpenAI frontier models improved roughly linearly over time on GDPval.
Speed/cost calculus: A typical gold-subset task takes a pro ~404 minutes (~6.7 hours) and costs ~$361. Grading a model’s output takes ~109 minutes and ~$86 of expert time. Using empirical API timings, OpenAI modeled “try once then fix” and “try n times then fix” workflows; under realistic review-and-redo assumptions, GPT-5 delivered ~1.39× faster and ~1.63× cheaper outcomes vs. unaided experts on average. Naïve ratios (ignoring review) wildly overstate savings (e.g., 90× faster), so the paper emphasizes review-aware metrics as more honest.
Why this matters: GDPval raises the bar for evaluating multi-modal, long-horizon, file-centric knowledge work—exactly where “AI agents” will be deployed. It shows where models shine (accuracy for GPT-5; aesthetics for Claude) and where they fail (instruction-following and formatting remain common culprits).
It also demonstrates that prompt-engineering and agent scaffolding materially improve outcomes (e.g., eliminating black-square PDF artifacts, reducing slide formatting errors, increasing inspection behavior, and +5-point win-rate gains). OpenAI is open-sourcing the gold subset’s prompts + reference files and hosting the grader at evals.openai.com. Read the announcement context at openai.com/index/gdpval/.
GDPval: Evaluating AI on Real, High-Value Knowledge Work
OpenAI’s GDPval is a large-scale evaluation designed to measure whether frontier models can perform real, economically meaningful knowledge work—the kind professionals do on a computer using documents, spreadsheets, slides, images, audio, and code. It avoids synthetic puzzles and “toy” tasks in favor of long, multi-file, subjective, and “deliverable-centric” problems that resemble the day-to-day output of analysts, designers, marketers, researchers, finance pros, and engineers.
The project centers on 1,320 tasks across 44 occupations in 9 sectors that together account for a substantial share of U.S. value added. The public gold subset contains 220 fully graded tasks for model benchmarking and replication.
Where many benchmarks are text-in/text-out, GDPval is multi-modal and file-heavy: tasks may require up to 17 reference files in the gold subset (and up to 38 in the full set), and ~68% of tasks require interacting with at least one reference file. The design introduces subjectivity (style/aesthetics matter), long-horizon difficulty (average expert time ~7 hours), and a non-saturating metric (pairwise win-rate vs. a human baseline) that can keep rising as systems improve.
What GDPval Contains and How It Was Built
Occupations and Sectors: $3T of Annual Wages
GDPval covers tasks from 44 occupations clustered into 9 sectors that each exceed 5% of U.S. GDP (based on Q2 2024 value added by industry). Occupations were chosen to maximize total wages/compensation and to be predominantly digital. “Digital” was operationalized via a task-level classification on O*NET data: GPT-4o labeled each task as digital vs. non-digital; occupation-level “digital share” was computed via weights derived from O*NET task relevance, importance, and frequency, and occupations were flagged digital if ≥60% of task-weight was digital. This digital share was validated against the Acemoglu & Autor (2011) framework—rising with non-routine cognitive content and falling with routine/manual content.
To map occupations to sectors, the team used the BLS National Employment Matrix, assigning each occupation to the sector with highest employment; compensation baselines draw on BLS OEWS (May 2024). In total, the included occupations account for roughly $3T in annual wages.
Each task comprises a request and a required deliverable. Experts mapped requests to O*NET task statements, rated difficulty/representativeness/time/quality, and monetized the work by median hourly wage × estimated hours. All 1,320 full-set tasks passed an iterative review pipeline with automated screening and multiple expert reviews—~5 human reviews on average (≥3 minimum). The gold subset was then graded via blinded pairwise expert comparisons, often taking >1 hour per comparison, to compute headline win-rates.
Task characteristics and file realism. GDPval systematically uses real file types (PPTX/DOCX/XLSX/PDF/PNG/JPEG/MP4/CAD/etc.). On the gold subset, typical tasks include ~1.9 reference files and ~1.5 deliverables; 67.7% require at least one reference file. Gold tasks average ~9.5 hours and ~$398 in value; full-set tasks average ~8.6 hours and ~$391 with fatter tails (max 605 hours and $32k, respectively).
Who Built and Graded the Tasks?
OpenAI recruited industry professionals with ≥4 years experience (average 14 years), screened via interviews, background checks, training, and a quiz; many worked at brand-name employers (finance, tech, media, government, healthcare, etc.). These experts created tasks and graded completions in their own occupational niches.
How GDPval Scores Models
Human Pairwise Grading (the “Gold Standard”)
For each gold-subset task, experts received the prompt + references and were asked to rank unlabeled deliverables produced by (a) a professional human and (b/c/…) frontier models, blinded as much as feasible to origin (stylistic tells remained, e.g., em-dash habits or first-person phrasing). The primary metric is win-rate vs. the human baseline.
The Automated Grader (Experimental)
OpenAI also trained an automated grader to perform pairwise judgments in the style of occupational experts. On the gold subset, automated grader agreement averaged ~66%, within 5 percentage points of human inter-rater agreement (~71%)—not a replacement for experts, but highly useful for rapid sweeps and ablations. The team also documents concrete limitations, marking 12/220 gold tasks ungradable for reasons like requiring internet access or non-Python runtimes or font discrepancies.
On blinded human comparisons, models are beginning to approach parity with industry experts on a variety of deliverables. In the gold subset, Claude Opus 4.1 achieved the strongest overall results—especially on aesthetics (document formatting, slide layout)—while GPT-5 excelled on accuracy and instruction-following. In aggregate, model submissions matched or beat human experts on just over half of tasks; specifically, 47.6% of Claude’s deliverables were wins or ties against the human baseline. Over OpenAI’s internal model generations, performance improved roughly linearly over time on GDPval.
Strengths and Weaknesses by Failure Theme
A clustering analysis of expert justifications finds that Claude, Gemini, and Grok most often lost due to instruction-following failures, while GPT-5 more often lost on formatting (despite fewer instruction-following issues). Gemini and Grok also promised deliverables they didn’t fully provide, ignored reference data, or used the wrong format. All models occasionally hallucinated or miscalculated, but GPT-5 and Grok had fewer accuracy errors than others.
Reasoning Effort, Prompting, and Scaffolding Matter (A Lot)
OpenAI tested reasoning effort at low/medium/high settings for o3 and GPT-5 and found predictable performance gains with more reasoning. They then explored prompt-tuning and scaffolding that encourages rigorous self-checks and multi-modal inspection (e.g., rendering deliverables as images, avoiding non-standard Unicode, standardizing fonts, and running best-of-N with a GPT-5 judge). This eliminated black-square artifacts that previously affected >50% of PDFs, reduced egregious PowerPoint formatting errors from 86%→64%, increased inspection behavior from 15%→97%, and boosted human-preference win-rates by ~5 percentage points. The message is simple: process engineering pays off.
OpenAI also created an under-contextualized variant of GDPval (prompts ~42% as long) to study ambiguity navigation: models performed worse when they had to figure out context on their own—highlighting that agent systems still struggle with goal discovery and data-seeking absent explicit instructions.
Speed and Cost: From Naïve Ratios to Review-Aware Math
A core GDPval contribution is honest speed/cost accounting that incorporates review and redo—because professionals must check AI outputs before shipping.
Human baselines (gold subset): average completion HT = 404 min; cost HC ≈ $361 (median wage × reported hours). Reviewing a model’s output averages RT = 109 min with RC ≈ $86.
Model timings/costs (MT, MC) come from empirical API logs and invoices; the key win-rate (w) is the fraction of tasks where the model beats the human deliverable under expert judgment.
OpenAI analyzes three scenarios:
Naïve ratio: Ignore quality differences and review time: HT/MT (speed) and HC/MC (cost). These numbers are enormous (e.g., GPT-5 looks ~90× faster and ~474× cheaper)—but misleading, because they assume zero review.
Try once, then fix: Do one model sample, review it, and if it’s below the quality bar, the expert does the task. The expected time/cost become E[T1]=MT+RT+(1−w)HTE[T_1]=MT+RT+(1-w)HTE[T1]=MT+RT+(1−w)HT and E[C1]=MC+RC+(1−w)HCE[C_1]=MC+RC+(1-w)HCE[C1]=MC+RC+(1−w)HC. GDPval
Try n times, then fix: Repeat “sample+review” up to nnn times, then hand it to the expert if still unsatisfactory, yielding closed-form expectations E[Tn],E[Cn]E[T_n], E[C_n]E[Tn],E[Cn] (geometric series in 1−w1-w1−w). In the limit n→∞n\to \inftyn→∞ (with w>0w>0w>0), time savings approach HT/((MT+RT)/w)HT/((MT+RT)/w)HT/((MT+RT)/w), cost savings HC/((MC+RC)/w)HC/((MC+RC)/w)HC/((MC+RC)/w).
Empirical takeaway (gold subset): Under review-aware assumptions, GPT-5 is ~1.39× faster and ~1.63× cheaper than unaided experts on average in the “try nnn times then fix” setup; o3 also shows solid gains (~1.28× / ~1.47×). If you ignore review, you’ll believe fantastical numbers like 90× (speed) and 474× (cost). The paper warns these naïve ratios overstate benefits.
OpenAI caveats that review time for human deliverables isn’t counted (teams often review each other’s work), and that catastrophic failures—though rare (~3% of GPT-5 losses)—carry outsized costs in some domains. Roughly 29% of GPT-5 losses were rated bad or catastrophic; the largest slice of losses were “acceptable but subpar,” implying many misses are fixable with iteration.
What the Numbers Mean for AI Agents and Enterprises
Deliverables win or lose on “boring” details. Instruction-following and formatting remain top failure modes. Even very capable models can misapply instructions, overlook reference data, or output mal-formatted files. Small process changes—rendering previews, standardizing fonts, linting layouts, best-of-N sampling, minimum-effort thresholds—tightly couple to quality.
Human-in-the-loop stays essential. GDPval’s review-aware economics show real—yet modest—productivity gains at today’s reliability, especially for GPT-5 and o3. The more critical or ambiguous the task, the more oversight and iteration matter.
Reasoning knobs and scaffolding are levers. Increasing reasoning effort and adding agent scaffolding systematically boost outcomes. For teams deploying agents, platform features (file renderers, validators, structured checklists) and prompt policies (style guides, unit checks) are not garnish—they’re throughput.
Ambiguity is still hard. The “under-contextualized” experiment demonstrates that models struggle with discovering constraints, finding the right data, and choosing deliverable formats when context is thin. Agent research should invest in context discovery and interactive task clarification.
How GDPval Differs from Prior Evals
Multi-modal, file-centric (CAD, audio, video, slides, spreadsheets—not just text). ~68% of tasks require reference files.
Long-horizon: tasks average ~7–10 hours of expert time; some span weeks.
Subjective + aesthetic criteria count (format, layout, document polish), making win-rate a human-preference metric rather than fixed-label accuracy. GDPval
Non-saturating metric: pairwise win-rate vs. a moving baseline (today human, tomorrow stronger models).
Economic grounding: occupations chosen by GDP share and wage mass; task value priced by median wage × time.
Open Sourcing and Reproducibility
OpenAI is open-sourcing the prompts and reference files for the 220-task gold set (expert deliverables and personally identifying details are omitted/scrubbed). The experimental automated grader is available at evals.openai.com. Usage notes and limitations are documented (e.g., 12/220 tasks ungradable due to grader constraints like lack of internet or non-Python runtimes; font rendering mismatch can also bite). For broader context and ongoing updates, see the announcement at openai.com/index/gdpval/.
Limitations and Future Work
Scope/size: The current full set covers 44 occupations with ~30 tasks each—a strong start, not a census of all knowledge work. Expansion is underway. GDPval
Computer-based knowledge work:Manual labor, physical tasks, and work needing tacit knowledge, PII, proprietary tools, or interpersonal interaction are out-of-scope (for now).
One-shot prompts: Real work is interactive; GDPval prompts are precisely specified. The under-contextualized variant shows how performance degrades when context is missing—pointing to future, more interactive setups.
Grader artifacts: Even with 66% automated grader agreement (vs. ~71% human inter-rater), some tasks remain tricky to auto-grade (e.g., internet dependencies, non-Python toolchains, fonts, speech-to-text). Human adjudication remains the reference.
Practical Guidance for Teams Adopting AI on Knowledge Work
Adopt review-aware metrics. When you pitch productivity, include review + redo time; use “try once then fix” or “try n times then fix” models to plan your workflows and SLAs.
Engineer the prompt and the agent runtime. Bake in mandatory formatting checks, file rendering previews, font/Unicode standards, best-of-N, and judging passes. These lifted GPT-5 win-rates by ~5 points and eliminated recurring artifacts.
Turn ambiguity into a first-class step. Where specs are thin, add explicit context-gathering and clarification routines. The under-contextualized study shows large value hiding here.
Play to model strengths. If aesthetics and layout matter most, Claude may excel; for instruction-following/accuracy, GPT-5 may be stronger—plan your routing and review tiers accordingly.
Track failure severity. Many losses are “acceptable but subpar” (upgradeable with iteration), but a small fraction are catastrophic—allocate human safety nets where stakes are high.
Why GDPval Is a Big Deal for AI Progress
GDPval is a stress test for agentic AI—the kind of work we expect AI copilots and automation platforms to handle inside companies: interpreting PDFs and emails, reconciling spreadsheets, building decks, formatting reports, and synthesizing multimedia. Three reasons it stands out:
It builds a bridge from benchmarks to business value by tying tasks to wage mass and task value, not just abstract accuracy.
It emphasizes deliverables, style, and layout, acknowledging that “professional quality” isn’t only correctness but also presentation.
It demonstrates that process improvements (reasoning effort, scaffolding, prompt policies) move the needle measurably—evidence that deployment engineering matters as much as pre-training.
In that sense, GDPval is both a measurement and a playbook: measure your agents on file-heavy deliverables with human pairwise grading; engineer your pipelines around review-aware economics; iterate on scaffolds that enforce format correctness, multi-modal validation, and best-of-N selection.
Closing Thoughts
GDPval shows a realistic path from “smart chatbots” to productive digital colleagues: build tasks that match real deliverables, grade against humans, insist on review-aware economics, and engineer the runtime. On this foundation, the paper documents genuine, repeatable gains—model parity on many tasks, linear progress in OpenAI’s internal models, and meaningful speed/cost improvements once you account for review. It also identifies the gaps that remain: instruction compliance, format fidelity, and ambiguity navigation.
For researchers, GDPval is a dataset, a methodology, and a north star. For companies, it’s a procurement checklist: if your vendor can’t show GDPval-style deliverables, human-graded win-rates, and review-aware ROI, they’re selling smoke. For the community, the open gold subset at evals.openai.com plus the announcement context at openai.com/index/gdpval/ make it straightforward to replicate, criticize, and extend this work—exactly what we need to turn hype into measured progress.
Sources & Notes
OpenAI, GDPval paper (PDF). Key details: task counts, sectors/occupations, gold-subset grading setup, automated grader agreement, model comparisons, and speed/cost methodology and results.
A.I. enthusiast with multiple certificates and accreditations from Deep Learning AI, Coursera, and more. I am interested in machine learning, LLM's, and all things AI.