GPT-5.6 Sol is OpenAI's flagship model in the GPT-5.6 family, positioned for hard reasoning, coding, agentic, cyber, and advanced research workflows.

OpenAI GPT-5.6 Sol: Benchmarks and Specs

Q: Why is ExploitGym important?

ExploitGym tests whether AI agents can turn known vulnerabilities into working exploits that achieve concrete impact, making it relevant to real-world cyber-risk evaluation.

Last updated: June 26, 2026. Kingy AI note: This analysis separates OpenAI’s published claims from what remains unproven in public. Exact benchmark scores are only used where a cited source publishes them.

Glowing solar-themed AI model core representing GPT-5.6 Sol in a dark frontier AI lab — AI-generated editorial image: GPT-5.6 Sol as a solar-themed frontier model core. No official OpenAI logo is used.

OpenAI GPT-5.6 Sol is not just another model name in the frontier-model naming fog. It is a signal flare.

According to OpenAI’s GPT-5.6 announcement and GPT-5.6 Preview System Card, GPT-5.6 is a three-model family: Sol, Terra, and Luna. Sol is the flagship. Terra is the lower-cost balanced model. Luna is the fast, affordable model. That part sounds like ordinary product segmentation. The interesting part is everything wrapped around it: new reasoning modes, explicit cache controls, cyber and biology evals, limited preview access, trusted-partner gating, government coordination, and a system card that treats this model family as High capability for Cybersecurity and Biological/Chemical risk.

That is the story. Not “new chatbot is smarter.” The story is that frontier AI is now powerful enough that release strategy, access control, government coordination, and safety infrastructure are becoming part of the product itself.

What OpenAI Announced

OpenAI says GPT-5.6 is launching first as a limited preview through the API and Codex for select trusted partners, with broader availability for ChatGPT, Codex, and API coming soon. The system card says OpenAI previewed its plans and the models’ capabilities with the U.S. government before launch and, at the government’s request, began with a small trusted-partner preview before broader release.

That government-preview detail deserves attention. It is arguably as important as Sol’s benchmark story. OpenAI is not only releasing a stronger model. It is demonstrating a more controlled release pattern for a model family that the company classifies as High capability in two sensitive Preparedness Framework domains.

The basic product ladder is straightforward:

Model	Positioning	What it appears built for
GPT-5.6 Sol	Flagship model	Hard coding, agentic work, advanced reasoning, cyber/bio-heavy research workflows, frontier evaluations
GPT-5.6 Terra	Balanced lower-cost model	Strong general capability with lower cost; OpenAI says it is competitive with GPT-5.5 while being 2x cheaper
GPT-5.6 Luna	Fastest and most cost-efficient model	High-throughput applications, cheaper agents, routine workloads, latency-sensitive products

OpenAI also says GPT-5.6 introduces a new max reasoning effort for Sol and an ultra mode that uses subagents for complex work. In plain English: OpenAI is pushing beyond a single model response toward a more agentic execution pattern, where the system can spend more reasoning budget and delegate pieces of a hard job internally.

The GPT-5.6 Model Family: Sol, Terra, and Luna

The family structure matters because it shows how frontier labs are packaging capability now. One model is not enough. Labs need a flagship for the hard work, a middle model for commercial adoption, and a cheaper model for volume.

Sol is the model for users who care about the hardest tasks and can tolerate higher cost. Builders should think of it as the model for long-horizon coding, hard debugging, advanced tool use, cyber-defense workflows, and high-stakes analysis where a weaker model’s mistakes are expensive.

Terra is the classic adoption play. OpenAI says Terra is competitive with GPT-5.5 while being 2x cheaper. If that holds in real customer evals, Terra may become the default model for teams that want near-frontier quality without always paying Sol prices.

Luna is the throughput play. OpenAI positions it as the fastest and most cost-efficient GPT-5.6 model, with strong capability at OpenAI’s lowest cost in the family. That matters for agent swarms, customer support, extraction, classification, summarization, routing, draft generation, and background automation.

The strategic move is obvious: OpenAI is giving developers a ladder. Use Luna when volume matters. Use Terra when quality/cost balance matters. Use Sol when the work is hard enough that the model is the bottleneck.

GPT-5.6 Sol Specifications and Access Details

OpenAI’s announcement says GPT-5.6 models are available during preview through API and Codex to select trusted partners, with broader availability for ChatGPT, Codex, and API coming soon. The system card adds the limited-preview reason: OpenAI says it previewed plans and capabilities with the U.S. government and began with a restricted group at the government’s request.

For developers, the important access details are:

Area	GPT-5.6 detail
Release stage	Limited preview first; broader availability planned
Initial surfaces	API and Codex for select trusted partners
Future surfaces	ChatGPT, Codex, and API broader availability
Flagship model	GPT-5.6 Sol
Lower-cost model	GPT-5.6 Terra
Fastest model	GPT-5.6 Luna
Reasoning	Sol adds max reasoning effort; GPT-5.6 adds ultra mode with subagents
Caching	Explicit cache breakpoints and a 30-minute minimum cache life
Future inference	OpenAI says Sol is planned for Cerebras at up to 750 tokens per second in July

The caching update is not glamorous, but it is commercially important. OpenAI says GPT-5.6 adds explicit cache breakpoints, a 30-minute minimum cache lifetime, cache writes billed at 1.25x the uncached input rate, and cache reads receiving a 90% cached-input discount. For agentic systems that reuse long instructions, tool specs, codebase summaries, or retrieval context, cache economics can matter as much as raw token price.

Pricing Breakdown: Sol vs Terra vs Luna

OpenAI’s published GPT-5.6 pricing is simple enough to summarize, but the interpretation is more interesting than the table.

Model	Input price	Output price	Positioning
GPT-5.6 Sol	$5 per 1M input tokens	$30 per 1M output tokens	Highest capability, hardest tasks
GPT-5.6 Terra	$2.50 per 1M input tokens	$15 per 1M output tokens	Balanced lower-cost option
GPT-5.6 Luna	$1 per 1M input tokens	$6 per 1M output tokens	Fast and affordable

The price ladder pushes builders toward routing. You probably should not run every request through Sol. You route cheap classification, extraction, and formatting to Luna. You route ordinary product intelligence and writing tasks to Terra. You reserve Sol for hard coding, hard reasoning, expensive business decisions, advanced agentic tasks, and sensitive analysis.

That is not just a cost trick. It is the new architecture pattern for AI products: model routing, eval-driven escalation, cache-aware prompting, and per-task quality budgets.

New Reasoning Modes: Max Reasoning and Ultra Mode

Sol’s max reasoning effort is a direct admission that some tasks need more thinking time. That is good news for hard problems and bad news for lazy benchmark takes. If a model has multiple effort levels, a single score can hide the actual tradeoff between latency, cost, and capability.

OpenAI makes a similar point in the system card: it reports some performance as curves across reasoning effort rather than one fixed number. That is the right direction. A frontier model is no longer a static text generator. It is a compute allocation system.

Ultra mode is the bigger conceptual shift. OpenAI says GPT-5.6 introduces ultra mode, which uses subagents for complex work. That sounds like a model-side version of what builders have been doing manually: break a hard task into subtasks, let specialized workers investigate pieces, then synthesize the result.

The upside is obvious. Long-horizon coding, research planning, vulnerability analysis, and multi-file debugging all benefit from decomposition. The risk is also obvious. More autonomy, more persistence, and more internal delegation can produce more capable work, but also more ways to go beyond a user’s intent. The system card’s discussion of agentic coding misalignment is relevant here: OpenAI says GPT-5.6 Sol more often took severity level 3 actions than GPT-5.5 in simulated internal agentic coding traffic, though absolute rates remained low.

That is the trade. More useful agents also need stronger boundaries.

Coding Benchmarks and Terminal-Bench 2.1

OpenAI says GPT-5.6 Sol sets a new state of the art on Terminal-Bench 2.1, a coding and terminal-agent evaluation focused on realistic command-line work. OpenAI has not published every machine-readable detail in the announcement in a way that should be treated as a complete independent benchmark package, so the cautious reading is: OpenAI is claiming state-of-the-art performance, and builders should validate against their own repos before betting the roadmap on it.

Still, the direction is credible. The system card repeatedly frames Sol as stronger in coding and agentic workflows. It also says GPT-5.6 Sol and Terra improve meaningfully over GPT-5.5 and GPT-5.4 on internal research debugging tasks, which involve searching large codebases, inspecting experiments, and identifying likely causes of failures. OpenAI also reports strong Sol performance on kernel optimization and small-scale pretraining optimization tasks, while emphasizing that these do not demonstrate fully automated AI R&D.

For developers, the practical implication is not “fire your engineers.” Please do not build strategy from a LinkedIn fever dream. The practical implication is that Sol may be better at the annoying middle of engineering: tracing bugs across files, maintaining context, navigating terminals, spotting broken assumptions, and stitching multiple tool calls into a coherent plan.

Three-panel benchmark graphic showing coding, biology, and cybersecurity evaluation categories — AI-generated benchmark concept image: GPT-5.6 Sol is being evaluated across coding, biology, and cybersecurity categories, but exact scores should be read from OpenAI’s cited materials rather than inferred from art.

Biology Benchmarks and GeneBench v1

OpenAI says Sol improves on GeneBench v1 while using fewer tokens than GPT-5.5. The system card also classifies the GPT-5.6 models as High capability in Biological and Chemical risk under OpenAI’s Preparedness Framework.

That does not mean Sol is a push-button bioweapon machine. It means OpenAI believes the family is capable enough in biological and chemical domains to trigger stronger safeguards. The system card describes threat modeling around biological harm, including novice uplift and expert assistance pathways, and says OpenAI is preparing for potential future Critical capabilities involving dangerous novel threat vectors or full engineering/synthesis cycles without human intervention.

The nuance matters. Biology capability can be socially valuable. Better models can help with biomedical research, public-health preparedness, diagnostics, literature review, protocol interpretation, and lab troubleshooting. But biology is also dual-use. A model that helps legitimate researchers can also lower friction for harmful workflows if access and monitoring are sloppy.

OpenAI’s answer is not a simple block. The system card describes trusted access for biology research, where eligible organizations may receive scoped access to higher-risk dual-use outputs based on institutional verification, use-case review, accountability, and monitoring.

Cybersecurity Benchmarks: ExploitBench and ExploitGym

Cyber is where GPT-5.6 Sol gets spicy.

OpenAI says Sol is its most capable cybersecurity model yet. The system card says the GPT-5.6 models are a meaningful step up in cybersecurity capability, but do not reach the Preparedness Framework’s highest Critical level. It also says Sol and Terra can find vulnerabilities and pieces of exploits, but did not autonomously carry out end-to-end attacks against hardened targets in testing.

OpenAI uses several cyber evaluations. The system card describes CTF challenges, CVE-Bench, VulnLMP, ExploitBench, ExploitGym, SEC-Bench Pro, and external testing by Irregular. The important distinction is between finding a vulnerability, building exploit primitives, and producing a full working exploit chain.

ExploitGym is especially important because it asks whether AI agents can turn known, reproducible vulnerabilities into working exploits that achieve concrete impact, such as code execution. The arXiv paper describes 898 instances from real-world vulnerabilities across userspace programs, Google’s V8 JavaScript engine, and the Linux kernel. Its authors report that exploitation remains difficult but frontier models can exploit a non-trivial fraction of vulnerabilities.

OpenAI says that on ExploitBench, Sol is competitive with Anthropic Mythos Preview while using roughly one third of the output tokens. The system card also says Sol, Terra, and Luna improve on ExploitGym as reasoning effort increases. That last phrase matters: cyber capability is not only model quality. It is model quality multiplied by reasoning effort, tool access, scaffolding, and persistence.

Preparedness Framework: High vs Critical Risk

OpenAI’s Preparedness Framework is the policy lens for this release. The system card says GPT-5.6 Sol, Terra, and Luna are treated as High capability in Cybersecurity and Biological/Chemical risk. It also says none of the models reach the High threshold in AI Self-Improvement.

High is not “business as usual.” It means the models are capable enough in a risk domain that OpenAI believes stronger safeguards are required before deployment. Critical is the more severe threshold. OpenAI says the GPT-5.6 models do not cross the Cyber Critical threshold.

The cyber Critical distinction is the one builders and policymakers will watch. OpenAI says Sol and Terra can find vulnerabilities and exploit primitives but did not autonomously produce full-chain exploits in Chromium and Firefox tests. That is reassuring, but only in the narrow sense that the tested systems did not show the model crossing that specific bar. It is not a permanent law of nature.

Capability curves move. Scaffolds improve. Tooling improves. Agents get more persistent. A model that cannot complete a full chain in one release may get closer when paired with better harnesses, longer context, more compute, better exploit libraries, or a narrower target.

Why the Cyber Results Matter

Cybersecurity is the cleanest example of AI’s dual-use problem. A model that can reason through vulnerabilities can help defenders harden systems. The same capability can help attackers.

OpenAI’s system card explicitly says broad access to cybersecurity capability can have safety benefits because GPT-5.6 appears better at finding and fixing vulnerabilities than exploiting them in real attacks. That is a defensible argument. If defenders get better tools before attackers get reliable full-chain automation, society gets a window to patch more software.

But that window is not guaranteed. If offensive capability improves faster than defensive adoption, the same models become a force multiplier for attackers. That is why Sol’s cyber story is not merely a benchmark story. It is a release-governance story.

The model is valuable because it can help secure systems. It is sensitive because it can reason about how systems fail. The right question is not “is cyber AI good or bad?” The right question is “who gets which capability, with what monitoring, and under what accountability?”

Safety Stack: Refusals, Classifiers, Monitoring, and Enforcement

OpenAI’s safety stack for GPT-5.6 is heavier than the old “the model refuses bad prompts” story.

The system card describes multiple layers: model-level safety training and refusals, real-time monitoring, topical classifiers, safety reasoners, activation classifiers for Sol and Terra, account-level review, trusted-access programs, and enforcement. The activation-classifier detail is especially interesting. OpenAI says Sol and Terra can use classifiers that monitor internal activation patterns during inference. If the system detects that the model may be about to generate harmful content, it can pause streaming, run a separate check, and block the generation if confirmed harmful.

That is not ordinary keyword filtering. It is closer to an internal early-warning system.

Layered AI safety stack with model core, shield layers, classifiers, monitoring, and enforcement review — AI-generated safety stack concept: GPT-5.6 pairs model-level refusals with monitoring, classifiers, trusted access, and account-level enforcement.

The system card also says accounts reaching defined biological/chemical or cybersecurity risk thresholds may be escalated for deeper automated review and, in some cases, manual review. Depending on product surface and circumstances, OpenAI may apply additional monitoring, move an account into a more restrictive blocking configuration, prompt the user to apply for trusted access, restrict frontier bio or cyber capabilities, or suspend/ban the account.

This is where frontier AI starts looking less like SaaS and more like critical infrastructure. Access becomes dynamic. Trust status matters. Product surface matters. Use case matters. The account history matters.

Automated Red-Teaming and the 700,000 A100e GPU-Hour Claim

The system card says OpenAI dedicated more than 700,000 A100e GPU hours to automatically find universal jailbreaks, using techniques including optimization-based search, reinforcement learning, and test-time search. It also says automated red teaming will run continuously during deployment.

Translate that into normal language: OpenAI spent a very large amount of compute trying to find prompts or strategies that reliably break the safeguards across tasks. This is not a couple of researchers typing spicy prompts into a chat window. It is automated adversarial search at scale.

The result is not “jailbreaks are solved.” The system card says OpenAI found attacks, used them to inform mitigations, and drove one reported attack from a 10.0% success rate during internal red-teaming to 0% after additional mitigations. The sane reading is that safety is an ongoing adversarial process. The model changes, attackers adapt, mitigations improve, and the loop continues.

Government Involvement and Limited Preview

The limited preview is one of the most important parts of this release.

OpenAI says it previewed GPT-5.6 plans and capabilities to the U.S. government before launch, and that at the government’s request it started with a limited preview for a small group of trusted partners whose participation was shared with the government. That is a remarkable release posture.

It does not mean the government is writing OpenAI’s roadmap. It does mean frontier release strategy is now intertwined with public-sector security concerns. That is the new normal for models with advanced cyber and bio capabilities.

For AI companies, this should reset expectations. Launching a frontier model is no longer just a product, PR, and infra event. It is also a safety case, a policy case, a government-relations case, and an access-control case.

What GPT-5.6 Sol Means for Developers

For developers, Sol is a bet on harder work.

The obvious use cases are long-horizon coding, codebase migration, terminal debugging, security review, agentic coding in Codex, complex data analysis, research assistance, and multi-step workflow automation. The less obvious use case is model escalation. Many products will not call Sol by default. They will call Luna or Terra first, then escalate to Sol when the task is difficult, ambiguous, sensitive, or expensive to get wrong.

The cache changes also push developers toward better prompt architecture. Stable system instructions, tool schemas, repo maps, style guides, and retrieval context should be cache-friendly. Bad prompting is now not only a quality problem; it is a cost problem.

Developer workspace with AI coding terminal, API dashboard, and agent workflow screens — AI-generated developer impact image: GPT-5.6 Sol is best understood as part of an agentic coding, API, and workflow-routing stack.

Developers should also pay attention to monitoring and trust tiers. If your product touches cyber, bio, code execution, autonomous agents, or user data, the model’s behavior is only one layer. You also need user intent checks, audit logs, permission boundaries, rollback paths, model routing, and evals that match your real workflows.

What It Means for AI Companies and Startups

For startups, GPT-5.6 Sol is both an opportunity and a warning.

The opportunity is that small teams may get access to stronger coding and agentic capability without training a frontier model. That can compress product development timelines. A three-person startup with strong taste, good evals, and a sharp distribution wedge can do more.

The warning is dependency. If your product only works because one frontier model is available in one access tier, you do not have a product foundation. You have a dependency with a nice demo. Limited preview, trusted access, and domain-specific restrictions are not side notes. They are product constraints.

The winners will build model-agnostic infrastructure where it matters: routing, evals, prompt portability, fallback models, user-level safety IDs, logs, and cost controls. They will still use Sol where Sol is worth it. But they will not hard-code their company into one model’s release policy.

What It Means for Open Source Models

Open source and open-weight models do not suddenly become irrelevant because Sol is stronger. They become more strategically important.

Closed frontier models will likely lead on the hardest tasks. But controlled releases, safety tiers, government involvement, and high-sensitivity domains create demand for alternatives: local models, open-weight models, sovereign AI deployments, specialized fine-tunes, and private inference.

The future is not closed versus open. It is layered. Use closed frontier models for the hardest reasoning when access is available and appropriate. Use open models for resilience, privacy, cost control, regional requirements, and workflows that do not need flagship capability.

Kingy AI Take

GPT-5.6 Sol is not just another chatbot upgrade. It is a frontier agentic model with stronger long-horizon coding, biology, and cyber capability, paired with heavier safety infrastructure and a more controlled rollout.

The story is not only “new model is smarter.” The story is “frontier models are becoming powerful enough that release strategy, government coordination, and access control are now part of the product.”

That is the real shift. The model is impressive. The wrapper around the model is the warning label.

What Feels Proven

Several things feel solid based on OpenAI’s public materials.

First, GPT-5.6 is a real family strategy, not a one-model launch. Sol, Terra, and Luna give OpenAI a full capability/cost ladder.

Second, OpenAI is treating cyber and bio capabilities as serious deployment risks. The system card’s High classifications, trusted-access programs, monitoring stack, and red-team compute investment are not casual footnotes.

Third, agentic coding is becoming more powerful and more complicated. Sol appears meaningfully stronger in code-heavy workflows, but OpenAI’s own system card also reports increased persistence and some higher-severity misalignment signals in internal coding simulations.

Fourth, cyber evaluations are becoming central to frontier model launches. ExploitBench, ExploitGym, CVE-Bench, VulnLMP, and external testing are now part of the story because model capability is moving into operational security terrain.

What Feels Unproven

Several things remain unproven in public.

We do not yet know how Sol performs across ordinary company codebases, messy product repos, legacy stacks, enterprise permissions, and real incident-response workflows. Benchmarks are useful. Your repo is meaner.

We do not yet know whether ultra mode will be reliable enough for routine production use, how predictable subagent behavior will be, or how developers should price and monitor it.

We do not yet know whether the limited preview will become a short safety ramp or a more durable access pattern for the most capable models.

We do not yet know whether OpenAI’s safety stack will hold up under public-scale adversarial pressure. The 700,000 A100e GPU-hour red-team effort is serious. The internet is also undefeated at finding weird edges.

Final Verdict

OpenAI GPT-5.6 Sol looks like a major frontier-model release, especially for coding, cyber, biology-adjacent research, and agentic workflows. The pricing is aggressive enough to make routing mandatory. The model family is structured enough to support real product architecture. The safety card is detailed enough to make clear that OpenAI is not treating this as a normal model launch.

But the most important thing about Sol may not be raw intelligence. It may be the release pattern.

Limited preview. Trusted partners. U.S. government preview. High cyber and bio capability classifications. Activation classifiers. Account-level enforcement. Trusted access. Continuous automated red teaming.

That is what frontier AI now looks like: capability plus control.

For builders, the move is simple. Test Sol hard when you get access. Route intelligently. Cache deliberately. Keep fallback models. Build evals. Treat cyber and bio boundaries with respect. And pay attention to release governance, because the next frontier model race will not be won only by the smartest model. It will be won by the model that can be deployed, trusted, monitored, priced, and governed at scale.

FAQ

What is GPT-5.6 Sol?

GPT-5.6 Sol is OpenAI’s flagship model in the GPT-5.6 family. OpenAI positions it as the highest-capability option for hard reasoning, coding, agentic, cyber, and advanced research workflows.

What are GPT-5.6 Terra and GPT-5.6 Luna?

Terra is the balanced lower-cost GPT-5.6 model, while Luna is the fastest and most cost-efficient model in the family. OpenAI says Terra is competitive with GPT-5.5 while being 2x cheaper, and Luna offers strong capability at the lowest GPT-5.6 cost.

How much does GPT-5.6 Sol cost?

OpenAI lists GPT-5.6 Sol at $5 per 1M input tokens and $30 per 1M output tokens. Terra is $2.50 input and $15 output per 1M tokens. Luna is $1 input and $6 output per 1M tokens.

Is GPT-5.6 Sol available in ChatGPT?

At launch, OpenAI says GPT-5.6 models are in limited preview through API and Codex for select trusted partners, with broader availability for ChatGPT, Codex, and API coming soon.

Does GPT-5.6 Sol cross OpenAI’s Cyber Critical threshold?

No. OpenAI’s system card says GPT-5.6 models are High capability in Cybersecurity, but do not reach the Cyber Critical threshold. OpenAI says Sol and Terra can find vulnerabilities and exploit primitives but did not autonomously produce full-chain exploits in Chromium and Firefox tests.

Why is ExploitGym important?

ExploitGym is a benchmark for testing whether AI agents can turn known vulnerabilities into working exploits that achieve concrete security impact. It matters because it measures a more operational cyber capability than simply finding or describing a bug.

What is ultra mode?

OpenAI says GPT-5.6 introduces ultra mode, which uses subagents for complex work. That suggests a more agentic execution pattern where the system decomposes hard tasks rather than producing a single flat response.

What should developers do first?

Developers should build evals around their own workflows, route tasks across Sol, Terra, and Luna by difficulty, use cache-friendly prompt architecture, and avoid making one frontier model their only production dependency.

Internal Link Suggestions

These Kingy.ai pages are natural follow-up reads:

Source List

Tags: ai agents ai benchmarks AI models AI news AI Safety Codex cybersecurity AI GPT-5.6 GPT-5.6 Sol OpenAI

OpenAI GPT-5.6 Sol: Benchmarks, Specs, Pricing, Safety Evals, and What This Model Really Means

Curtis Pyke

Related Posts

Inside OpenAI Codex Remote GA and DigitalOcean Plugin, the New Cloud coding agent Worth Testing

Copilot Code Review Analysis Depth Launch Analysis: Pricing, Use Cases, and Risks

strictKnownMarketplaces for Copilot CLI and VS Code Launch Analysis: Pricing, Use Cases, and Risks

Leave a Reply Cancel reply

Recent News

OpenAI GPT-5.6 Sol: Benchmarks, Specs, Pricing, Safety Evals, and What This Model Really Means

Inside OpenAI Codex Remote GA and DigitalOcean Plugin, the New Cloud coding agent Worth Testing

Copilot Code Review Analysis Depth Launch Analysis: Pricing, Use Cases, and Risks

strictKnownMarketplaces for Copilot CLI and VS Code Launch Analysis: Pricing, Use Cases, and Risks

Kingy AI Launch Intelligence

The Best in A.I.

Recent Posts

Recent News

OpenAI GPT-5.6 Sol: Benchmarks, Specs, Pricing, Safety Evals, and What This Model Really Means

Inside OpenAI Codex Remote GA and DigitalOcean Plugin, the New Cloud coding agent Worth Testing