Inside GPT-5.4: The AI That Codes, Thinks, and Controls Your Computer

OpenAI’s most capable model yet doesn’t just answer questions. It takes over your computer, thinks harder than ever, and is gunning for your job.

The Model That Does It All

OpenAI dropped GPT-5.4 on March 5, 2026. And this one is different.

For months, OpenAI’s capabilities were scattered across a constellation of specialized models. GPT-5.3 Codex handled software engineering. GPT-5.2 carried the reasoning torch for knowledge work. GPT-5.3 Instant powered everyday chat. Each model was good at its lane, but only its lane.

GPT-5.4 changes that. It consolidates everything into a single, unified frontier model. Coding. Reasoning. Agentic workflows. Native computer use. All in one package. The Verge calls it “a big step toward autonomous agents,” and that framing is exactly right.

OpenAI describes GPT-5.4 as its “most capable and efficient frontier model for professional work.” That’s a bold claim. But the benchmarks, and the real-world tests, back it up.

From Chatbot to Computer Operator

Here’s the headline feature: GPT-5.4 can use your computer.

Not metaphorically. Literally. The model looks at a screenshot of your desktop, understands what’s on the screen, and responds with mouse clicks and keyboard inputs to get things done. It opens applications. Navigates menus. Fills out forms. Builds spreadsheets. Writes and runs code, all autonomously, through the graphical user interface.

This is OpenAI’s first general-purpose model with native computer use capabilities. Before GPT-5.4, this feature existed in ChatGPT’s agent mode, but it was unreliable and rarely used. The Decoder notes that “it worked unreliably and was rarely used.” That might be changing now.

The numbers tell the story. On OSWorld Verified, a benchmark that measures how well an AI navigates a real desktop environment GPT-5.4 scored 75.0 percent. GPT-5.2 sat at just 47.3 percent. The human baseline on the same benchmark? 72.4 percent.

Let that sink in. GPT-5.4 now surpasses human performance on computer navigation tasks.

As FelixNg writes for AI for Life: “If you have been thinking about AI as a very smart text generator, this is the moment to update that mental model. GPT-5.4 is not a tool you use. It is an agent that uses your tools.”

The Benchmarks Are Genuinely Impressive

Let’s talk numbers. Because the performance jump from GPT-5.2 to GPT-5.4 is not incremental. It’s a leap.

On OpenAI’s in-house GDPval benchmark, which tests agents across 44 professions from the nine industries contributing most to US GDP — GPT-5.4 scored 83.0 percent, meeting or exceeding industry professionals. GPT-5.2 scored 70.9 percent. That’s a 12-point jump.

The biggest gains show up in spreadsheets. For investment banking modeling tasks, GPT-5.4 scored 87.3 percent compared to 68.4 percent for its predecessor. For presentations, human evaluators preferred GPT-5.4’s output 68 percent of the time, citing better aesthetics and visual variety.

Abstract reasoning saw a massive improvement too. GPT-5.4 Pro hit 83.3 percent on ARC-AGI-2, while GPT-5.2 Pro managed just 54.2 percent. On BrowseComp, which measures how well AI agents track down hard-to-find information on the web, GPT-5.4 scored 82.7 percent, up from 65.8 percent for GPT-5.2.

And hallucinations? OpenAI says individual claims are 33 percent less likely to be wrong, and complete answers are 18 percent less likely to contain errors compared to GPT-5.2. OpenAI is calling it their “most factual model yet.”

A Million Tokens and an “Extreme” Thinking Mode

Before the official launch, The Decoder reported that GPT-5.4 would feature a one-million-token context window, more than double the 400,000 tokens in GPT-5.2. That would put OpenAI on par with Google and Anthropic.

The report was accurate. In Codex, GPT-5.4 experimentally supports a context window of up to one million tokens. That’s a game-changer for long-term planning and execution tasks. Think: ingesting an entire codebase, a full legal document library, or months of research notes in a single session.

There’s also a new “extreme” thinking mode, aimed at researchers rather than everyday users that lets the model burn significantly more compute on tough questions. In ChatGPT, GPT-5.4 Thinking shows a preview of its planned approach for complex requests. Users can add instructions or change direction before the model finishes its response. No need to start over. No need for multiple back-and-forth queries.

“This makes it easier to guide the model toward the exact outcome you want without starting over or requiring multiple additional turns,” OpenAI says. The feature is live on chatgpt.com and Android, with iOS coming soon.

What Real-World Testing Looks Like

Benchmarks are one thing. Real-world performance is another. So what happens when you actually put GPT-5.4 through its paces?

AI Tools Club ran a series of hands-on tests research, PowerPoint creation, and vibe coding, and the results were striking.

For a complex research task requiring 10+ sources, GPT-5.4 Thinking completed the full reasoning in 1 minute and 44 seconds. The output was thorough, well-cited, and structured. The tester also demonstrated the new mid-response steering feature, initially asking for 10 sources, then bumping the request to 20 without pausing or restarting. The model adapted on the fly.

For a 10-slide PowerPoint presentation built from that research, GPT-5.4 took 22 minutes and 45 seconds. Slower than expected, but faster than a human. The output was professional, creative, and downloadable in PPTX format.

For vibe coding, building a 3D chess game with marble and glass aesthetics, realistic lighting, multiple design options, and a timer — GPT-5.4 wrote 585 lines of code in under 3 minutes. The result was functional, animated, and visually decent.

The standout feature for the tester wasn’t raw power or speed. It was the ability to add or remove context while the model is thinking, without pausing or restarting. That’s a workflow shift. That’s the kind of thing that changes how people actually use AI day-to-day.

The Model Lineup Gets Cleaner

GPT-5.4 also simplifies OpenAI’s increasingly confusing model lineup. As Cogni Down Under explains, picking an OpenAI model in early 2026 felt like “ordering coffee in a city you’ve never visited.”

GPT-5.3 Instant was the fast, conversational workhorse. GPT-5.2 Thinking was the deep-reasoning specialist. GPT-5.3 Codex was the coding champion. Each had its lane. Each required a different decision.

GPT-5.4 collapses those distinctions. It combines the coding capabilities of GPT-5.3 Codex with the improved reasoning of GPT-5.2, adds native computer use, and wraps it all in a single model. OpenAI skipped GPT-5.3 Thinking and Pro entirely, the numbering jump to 5.4 reflects the magnitude of the leap.

Going forward, OpenAI says Instant and Thinking models will continue to develop at different speeds. GPT-5.3 Instant remains the default chat model in ChatGPT. GPT-5.4 Thinking is now available for Plus, Team, and Pro users, replacing GPT-5.2 Thinking. The older model sticks around for three months under “Legacy Models” before being discontinued on June 5, 2026.

Tool Search: The Technical Trick That Cuts Costs Nearly in Half

One of the most interesting technical changes in GPT-5.4 is something called Tool Search in the API.

Previously, all tool definitions were loaded into the prompt in full, eating up thousands of extra tokens in large tool ecosystems. GPT-5.4 takes a smarter approach. It only receives a lightweight list of available tools and pulls up complete definitions on demand.

The Decoder reports that this cut token consumption by 47 percent in a test with 250 tasks from the MCP Atlas benchmark, while maintaining accuracy. For MCP servers with tens of thousands of tokens in tool definitions, the savings are substantial.

OpenAI argues that even though GPT-5.4 costs more per token than GPT-5.2, it uses significantly fewer tokens for the same tasks — which should offset the higher per-token pricing. Whether that math works out in practice depends on your workload. But the efficiency gains are real.

Coding Gets Faster, Not Just Better

On the coding front, GPT-5.4 scores 57.7 percent on SWE-Bench Pro, slightly above GPT-5.3 Codex (56.8 percent) and GPT-5.2 (55.6 percent). The gains are modest. But speed gets a serious boost.

A new /fast mode in Codex boosts token speed by up to 1.5x without sacrificing model quality. For long agentic runs where the model is iterating through code, running terminal commands, and debugging in loops, that speed gain matters enormously.

Silicon Valley Gradient notes that GPT-5.4 “ranks as the best coding model, the best agentic model,” and ties with Gemini for best intelligence on third-party benchmarks. It’s not the flashiest release architecturally. But it might be the most competent model OpenAI has ever shipped.

To show off the combined coding and computer-use capabilities, OpenAI released an experimental Codex skill called “Playwright (Interactive)” that lets Codex visually debug web and Electron apps. In a demo, GPT-5.4 generated an isometric theme park simulation game complete with path placement, guest pathfinding, and queues from a single prompt.

Safety Gets Serious Too

GPT-5.4 is powerful. And OpenAI knows that power cuts both ways.

According to the Model Card, GPT-5.4 Thinking is the first general reasoning model to receive a “High Capability” cybersecurity classification under OpenAI’s Preparedness Framework. This means the model can remove existing barriers to cyberattacks, for example, by automating end-to-end attacks on protected targets or automatically finding and exploiting security vulnerabilities.

The only level above “High” is “Critical,” where a model can find zero-day exploits in hardened systems without human help.

OpenAI says it built a new protection system for GPT-5.4. Instead of downgrading suspicious users to a weaker model, the system uses real-time blockers at the message level, backed by a two-stage monitoring system: a topic classifier and an AI-powered security analyst. Jailbreak resistance has improved significantly compared to GPT-5.1 Thinking.

OpenAI researcher Noam Brown, one of the key figures behind the company’s reasoning model breakthrough, put it plainly: “We see no wall, and expect AI capabilities to continue to increase dramatically this year.”

What This Means for You

AI assistant autonomously managing professional tasks on a computer including spreadsheets, coding, and research.

GPT-5.4 is rolling out now across ChatGPT, Codex, and the API. It is available for Plus, Team, and Pro users. GPT-5.4 Pro, for “maximum performance on complex tasks” is rolling out in the API and for ChatGPT Enterprise and Edu users.

OpenAI also launched a new ChatGPT add-in for Excel aimed at enterprise customers, leaning hard into the professional productivity angle.

The agentic future that AI companies have been promising for years is arriving faster than most people expected. GPT-5.4 doesn’t just answer your questions. It takes over your computer, steers its own reasoning, and completes complex multi-step tasks across applications, autonomously.

That’s not a chatbot. That’s something new.

Sources

Compare

Recent Launches

Latest News

OpenAI GPT-4.1: Inside OpenAI’s Evolution and Roadmap

Continue Reading

OpenAI GPT-5.5: The AI That Works Like a Real Coworker

Kingy Launch Brief

Every Friday, the verified AI launches, apps, funding rounds, pricing changes and under-the-radar moves worth knowing—source-linked and explained in five minutes.

Free · Every Friday · Unsubscribe anytime · No daily email