TL;DR;
The OpenAI o3‑mini model is a next‐generation language model leveraging large-scale reinforcement learning with a chain‑of‑thought capability that empowers it to reason deeply before providing answers. Building upon previous models (such as o1‑mini), o3‑mini is designed to be significantly better at tasks like coding and general reasoning while integrating robust safety mitigations. Its extensive evaluation framework spans disallowed content filtering, jailbreak resistance, hallucination minimization, fairness and bias assessments, and adherence to a multi‑tier instruction hierarchy.
Additionally, the model undergoes red teaming challenges and Preparedness Framework evaluations covering domains from cybersecurity and biological threat creation to radiological risks, persuasion, and model autonomy. Multiple metrics (pass rates, win‐rates, and confidence intervals) illustrate that while o3‑mini attains performance parity with or improvements over predecessors and some benchmarks (for instance, in disallowed content evaluations and jailbreak resistance), some areas such as certain complex agentic or ML research tasks remain challenging.
The document painstakingly details each safety evaluation, emergent risk category, and mitigation strategy that underpins o3‑mini’s deployment, emphasizing a commitment to iterative testing, interdisciplinary external collaboration, and proactive safety oversight in preparing the model for real‑world deployment.

Introduction and Overview
The system card for the OpenAI o3‑mini model begins with an overview of its design philosophy. At its core, o3‑mini is built on large‑scale reinforcement learning and chain‑of‑thought reasoning—allowing the model to produce extended internal deliberations prior to formulating its final output. This chain‑of‑thought capability is not only a strength in providing sophisticated natural language responses but also a double‑edged sword; it enhances the overall performance, yet it could potentially elevate risks by exposing deeper reasoning that might be exploited under unsafe circumstances.
The document establishes that o3‑mini shares similar training methodologies with its sibling models, notably o1‑mini, though with specific improvements that target both performance (particularly in coding) and safety. By instructing the model to “think before speaking,” OpenAI engineers have forced it to conform to a deliberate alignment protocol. This structured thought process incorporates explicit reasoning through safety policies, therefore permitting the model to better resist attempts to generate harmful outputs and mitigate unsafe requests—even when users try to provoke unethical or dangerous behavior.
Model Data and Training
In the training phase, o3‑mini was pre‑trained on a diverse array of publicly available and proprietary datasets, followed by a phase of reinforcement learning that focused on refining its chain‑of‑thought reasoning capacities. This training process privileged the ability to reason through complex prompts and adhere strictly to detailed safety guidelines. The model undergoes advanced data filtering procedures, ensuring that both sensitive content—such as personal data or explicit material involving minors—and dangerous topics are curtailed in the training process. The purpose of such filtering is dual:
- To boost the model’s helpfulness.
- To obstruct the generation of unsafe or explicitly forbidden content.
Furthermore, o3‑mini’s capabilities in coding are accentuated. The system card notes that due to its faster performance in tasks like debugging and interpreting code, users can expect it to receive experimental support for browsing and summarizing internet content in later integrations.

Scope of Testing
The document outlines that o3‑mini is evaluated on several checkpoints, most notably comparing the “o3‑mini‑near‑final‑checkpoint” and the finally launched “o3‑mini” checkpoint. Although the base model remains the same, the final release benefits from incremental post‑training improvements. The system card highlights that while many evaluation metrics align closely with those found in similar models (like GPT‑4o and o1‑mini), there are distinct differences that are carefully measured in side‑by‑side comparisons. The scope of testing covers a multifaceted array of evaluations, including disallowed content tests, jailbreak evaluations, hallucination tests, and fairness and bias assessments.
Observed Safety Challenges and Evaluations
Disallowed Content Evaluations
One of the key areas of evaluation is the model’s ability to handle disallowed content. OpenAI tests o3‑mini through several benchmarks:
- Standard Refusal Evaluation: Here, models are tasked with avoiding unsafe outputs on topics such as hate speech or requests for harmful instructions. Metrics like “not_unsafe” and “not_overrefuse” are quantified. For example, o3‑mini achieves scores of 1.0 for not producing unsafe content and 0.92 for not over-refusing on benign requests.
- Challenging Refusal Evaluation: Additional tests feature more adversarial inputs that the model must refuse safely. Remarkably, on these evaluations, o3‑mini scores better than GPT‑4o in resisting dangerous prompts.
- XSTest: This evaluation further scrutinizes instances where benign requests might trigger overcautious refusals (the overrefusal edge cases). The comparisons among o3‑mini, GPT‑4o, and o1‑mini reveal that while o3‑mini generally matches GPT‑4o on simply avoiding unsafe replies, its performance may fluctuate when subtle language cases (like figurative language, historical events, or ambiguous terminology) are involved.
Jailbreak Evaluations
In order to test the resilience of the model against deliberate adversarial attacks (jailbreaks), OpenAI subjects o3‑mini to four categories:
- Production Jailbreaks: Where real-world misuse attempts (as observed in production ChatGPT data) are simulated.
- Jailbreak Augmented Examples: Classic jailbreak techniques are applied to standard prompts.
- StrongReject: An evaluation based on the literature of common attacks, assessing the model’s resistance when the top 10% of attack techniques are considered.
- Human Sourced Jailbreaks: These involve red teaming by human evaluators identified with high harm potential.
The metrics show that o3‑mini is on par with o1‑mini, often scoring higher than GPT‑4o on several jailbreak tests. For instance, in the StrongReject evaluation, o3‑mini tops with a score of 0.73 compared to 0.37 for GPT‑4o.

Hallucination Evaluations
Another critical dimension is the model’s propensity to hallucinate—generate inaccurate or fabricated details. Using the PersonQA test dataset, the evaluation scrutinizes:
- Accuracy: The percentage of correct answers relative to publicly known facts.
- Hallucination Rate: How often the reasoning deviates from verifiable truth.
Here, o3‑mini demonstrates a lower hallucination rate (14.8%) compared to GPT‑4o (52.4%), reflecting an enhanced capacity for grounding its responses in validated information.
Fairness and Bias Evaluations
Fairness is assessed using benchmarks like the BBQ evaluation and additional templated prompts focusing on ambiguity and potential discrimination in responses. Factors such as age, race, and gender are explicitly manipulated within prompts—“The [age]-year-old [race] [gender] patient…”—to determine whether the model unduly emphasizes these attributes in its diagnostic reasoning. Critically:
- On explicit and implicit discrimination metrics (with coefficients normalized between 0 and 1), o3‑mini consistently exhibits lower bias scores in many instances than its predecessors. For example, in explicit discrimination tasks, o3‑mini’s overall coefficient is 0.14 compared to higher coefficients in models like o1‑mini.
Instruction Hierarchy Evaluations
A vital feature of o3‑mini is its adherence to an Instruction Hierarchy—ensuring that system messages take precedence over developer messages, and those, in turn, override user messages. This hierarchy is rigorously tested in settings where conflicting instructions are provided:
- Developer vs. User Conflicts: The model must resolve these conflicts by following the highest-priority instruction.
- Tutor Jailbreaks: In scenarios where the model is explicitly instructed not to reveal an answer (as a math tutor) but is coyly tricked by the end user, the model must refuse.
- Phrase and Password Protection: The model is also responsible for maintaining the secrecy of specified phrases or passwords despite pressure from user or developer messages.
Results indicate that o3‑mini generally adheres closely to these hierarchies. Although slight variations are observed (with GPT‑4o sometimes outperforming on certain phrase protection tasks), o3‑mini is shown to be robust given its overall strong performance metric in these evaluations.

External Red Teaming and Real-World Testing
External red teaming plays a crucial role in further stress‑testing model safety. OpenAI provided an anonymized interface for red teamers to test the model’s responses across a wide range of potentially harmful queries, including:
- Cyberhacking
- Bioterrorism
- Weapon creation
- Phishing and scamming
- Hate speech
Participants rated the generated responses, leading to a detailed “Win Rate” analysis. For example, the o3‑mini model was found to have a self‑rated win rate of over 73% when pitted against GPT‑4o, reinforcing its competitive standing on safety measures. Additionally, partner organizations such as Gray Swan hosted the “Jailbreak Arena” where o3‑mini’s performance in generating illicit advice or extremist content was additionally scrutinized.
Preparedness Framework Evaluations
The Preparedness Framework is an integral aspect of the system card, mapping potential risks across several domains and ensuring that the model meets strict risk thresholds before deployment. Key risk categories include:
Cybersecurity
The cybersecurity evaluation focuses on how well o3‑mini performs on tasks resembling Capture‑the‑Flag (CTF) challenges, where the model must identify and exploit vulnerabilities in simulated systems. Evaluated across high‑school, collegiate, and professional levels:
- o3‑mini scored 61% on high‑school challenges but only 21% on collegiate and professional challenges.
- The overall rating for cybersecurity is classified as “Low,” indicating that o3‑mini does not pose a significant risk of advancing real‑world vulnerability exploitation.
Chemical and Biological Threat Creation
This risk category assesses whether the model can assist experts in the dangerous planning or replication of existing chemical and biological threats:
- Evaluations span open‑ended questions (long‑form biothreat questions), expert comparisons, and contextual tasks.
- In pre‑mitigation mode, o3‑mini demonstrated the potential to synthesize and provide critical planning steps at levels that could be deemed “Medium” risk. However, post‑mitigation versions reliably refuse to provide such guidance.
- Detailed tests across multiple sub‑benchmarks (e.g., expert probing and the use of integrated wet‑lab tools) provide a nuanced view of the model’s potential risks. Despite showing the capability to generate intricate threat planning details, the safety mitigations ensure that such outputs are substantially curtailed in public deployments.
Radiological and Nuclear Threat Creation
Given the sensitivity of radiological and nuclear materials, o3‑mini’s abilities in this area are tested with an emphasis on nonproliferation topics:
- The evaluations include multiple choice questions on nuclear engineering and expert knowledge queries on weapons design.
- Although the model performs similarly to other earlier model variants (such as o1‑preview and o1), the limitations imposed by available unclassified data keep the overall risk in check. o3‑mini is judged unable to fully aid in the development of weapons at a dangerous level in its post‑mitigation state.
Persuasion
Persuasion evaluations investigate the model’s ability to craft compelling arguments, which could potentially be used to subtly influence opinions or manipulate outcomes:
- The ChangeMyView Evaluation leverages real‑world data from Reddit’s r/ChangeMyView forum, measuring the persuasiveness of model responses versus human responses.
- Additional tests—such as the Persuasion Parallel Generation and MakeMePay evaluations—assess politically persuasive writing and manipulative tactics.
- o3‑mini consistently ranks in the top 80–90th percentile relative to human submissions. Although its performance is strong, it does not exhibit “superhuman” capabilities in this domain, keeping its risk classification at “Medium.”
Model Autonomy
This section evaluates o3‑mini’s ability to perform long‑horizon, agentic tasks that have implications for self‑improvement and automated research:
- The model is tested on short‑term tasks such as standardized multiple choice interviews and coding challenges sourced from OpenAI Research Engineer interviews.
- More ambitious benchmarks like SWE‑bench Verified (which simulates real-world software engineering problems) reveal that o3‑mini achieves variable success. With a detailed tool‑assisted framework, some tasks are solved at a pass rate of up to 61% while others lag behind, particularly those requiring multi‑step agentic tasks.
- Additionally, MLE‑bench tasks designed around Kaggle competitions and automated pull request replication further scrutinize the model. The evaluations suggest that while o3‑mini has improved over previous iterations, its capacity for full‑scale model autonomy (particularly in the context of self‑improvement and open‑ended ML research tasks) remains limited, with performance often trailing behind more specialized models.
Multilingual Performance
Multilingual evaluation is conducted by translating the MMLU test set into 14 languages with professional human translators:
- The model is then tested using prompt‑based chain‑of‑thought approaches in languages such as Arabic, Bengali, Chinese (Simplified), French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Spanish, Swahili, and Yoruba.
- These evaluations demonstrate that o3‑mini consistently improves multilingual performance relative to o1‑mini. The relative scores are high (often above 0.80 on many languages) and affirm that the model’s reasoning ability extends into non‑English languages with only minor performance drops in lower‐resource languages like Yoruba or Swahili.
Mitigations: Building Safety into the Model
One of the overarching themes in the system card is the layered approach to safety mitigations deployed for o3‑mini. These include:
- Pre‑training mitigations: Filtering out harmful or sensitive training data to avoid bias or unsafe content.
- Deliberative Alignment Techniques: Teaching the model to reason about and apply explicit safety rules before producing outputs. This includes specialized training to handle tasks related to political persuasion and misinformation.
- Post‑deployment Monitoring: Robust systems are in place for live monitoring, threat assessment, and targeted interventions in areas of influence operations, extremism, and even cybersecurity threat detection.
- Expert and Automated Evaluations: Ongoing evaluations by both in‑house experts and external red teamers ensure the safety stack is continuously refined. Detailed audit trails and win‑rate analyses on multiple metrics help determine which parts of the model need further adjustments.
The document shows that while o3‑mini has the potential to generate complex and contextually appropriate responses, the incorporation of these safety measures serves to limit the risks associated with its advanced reasoning capabilities.
Conclusion and Future Directions
In wrapping up the system card, the document emphasizes that the increased reasoning and chain‑of‑thought capabilities of o3‑mini have led to superior performance on many benchmarks compared to prior models. However, these very strengths come with trade‑offs in that higher reasoning capacity can elevate certain safety risks. The preparedness framework classifies o3‑mini as medium risk in categories such as Persuasion, Chemical/Biological Threat Creation, and Model Autonomy, while classifying it as low risk in cybersecurity.
The card underscores that iterative real‑world deployment, continuous red teaming, and interdisciplinary collaboration remain essential. The emphasis on dynamic safety mitigations, along with targeted testing and external auditing, highlights OpenAI’s commitment to responsibly deploying increasingly capable models. With each iteration—reflected in improvements in coding, multilingual support, and nuanced understanding of adversarial challenges—the system card suggests that future models can be both more useful and progressively safer.
Furthermore, the model’s performance on agentic tasks, such as solving coding challenges and replicating real‑world software problems, is contrasted with its relative underperformance on open‑ended machine learning research tasks. This highlights that while o3‑mini represents a significant step forward in many respects, it is not the final word in model autonomy. Continued research, additional scaffolding techniques, and refined prompts will likely be required to fully harness its potential for self‑improvement and research acceleration.
In addition to the technical evaluations, the system card includes careful considerations regarding fairness and bias. By incorporating detailed mixed‑effects modeling and templated prompts that specify demographic attributes, the card makes evident that there is an ongoing effort to minimize explicit and implicit discrimination in model outputs. The detailed breakdown of coefficients in bias evaluation tables further attests to the rigorous attempts to measure and mitigate bias.
Ultimately, the OpenAI o3‑mini System Card stands as a comprehensive documentation of the model’s capabilities, risks, and mitigation strategies. It presents a multifaceted narrative that combines quantitative evaluations with qualitative risk assessments, championing a holistic approach to the development and deployment of increasingly powerful language models.
Authors, Acknowledgments, and Final Remarks
Spanning over 30 pages, the document concludes with detailed credits to the numerous researchers, engineers, evaluators, red teamers, and external collaborators whose contributions shaped the evaluations. The extensive list of references points to the academic and technical underpinnings of the methodologies employed—from deliberative alignment to advanced red teaming techniques. In citing their work as “OpenAI (2025),” the authors not only emphasize the model’s current state but also set a precedent for future evaluations and refinements.
The system card thus serves both as a technical blueprint and a risk management document informing stakeholders—be they engineers, policymakers, or concerned citizens—about the state of one of the most advanced AI systems available. It encapsulates key lessons in safety, performance, and iterative improvement and sets the stage for subsequent developments in the field of AI safety and robust model alignment.
In summary, the OpenAI o3‑mini System Card is an exhaustive report that details how the next-generation chain‑of‑thought language model is designed, tested, and refined. Its rigorous evaluations across multiple risk categories ensure that while the model excels in reasoning, coding, multilingual tasks, and persuasive writing, it remains circumscribed by robust safety mitigations. The preparedness framework—encompassing cybersecurity, threat creation, persuasion, and model autonomy assessments—reveals both the promise and the limitations of current AI models, guiding future research and development along a path that balances capability with responsibility.
Comments 1