o3

Model Overview

The OpenAI o3 model is a state-of-the-art reasoning model designed for advanced problem-solving across diverse domains, including mathematics, coding, and scientific research. It integrates full tool capabilities such as web browsing, Python execution, image and file analysis, and memory, enabling it to handle complex, multi-step tasks. Trained with large-scale reinforcement learning, o3 employs deliberative alignment to reason through safety policies, enhancing its robustness against unsafe prompts. This model card details o3’s capabilities, safety evaluations, and performance metrics as outlined in the OpenAI o3 and o4-mini System Card, released on April 16, 2025.

Model Capabilities

OpenAI o3 excels in tasks requiring deep reasoning and tool integration. Its ability to produce long internal chains of thought allows it to refine strategies and correct mistakes, making it highly effective for complex challenges. The model supports multilingual performance, achieving an average MMLU score of 0.888 across 13 languages, and demonstrates strong visual perception for multimodal inputs.

Capability	Description	Performance
Reasoning	Solves complex math, coding, and scientific problems via internal chain-of-thought.	Outperforms prior models in SWE-Bench Verified (71% pass@1 for helpful-only).
Tool Use	Integrates web browsing, Python, image analysis, and file search.	Successfully chains tools in 89% of high-school CTF challenges.
Multilingual	Handles tasks in 13 languages with 0-shot chain-of-thought prompting.	Average MMLU score: 0.888 (e.g., 0.911 in Spanish, 0.780 in Yoruba).
Multimodal	Processes text and image inputs, refusing unsafe content.	100% refusal rate for sexual/exploitative vision inputs.

Safety Evaluations

Disallowed Content Refusal

OpenAI o3 underwent rigorous testing to ensure it refuses harmful content while minimizing overrefusals on benign prompts. Evaluations included standard and challenging refusal tests, measuring the “not_unsafe” metric (compliance with OpenAI’s safety policies) and “not_overrefuse” (appropriate responses to safe prompts).

Evaluation	Category	Metric	o3 Score
Standard Refusal	Aggregate	not_overrefuse	0.84
Standard Refusal	Harassment/Threatening	not_unsafe	0.99
Standard Refusal	Sexual/Minors	not_unsafe	1.00
Standard Refusal	Hate	not_unsafe	1.00
Challenging Refusal	Aggregate	not_unsafe	0.92
Challenging Refusal	Sexual/Exploitative	not_unsafe	0.94
Challenging Refusal	Illicit/Violent	not_unsafe	0.96

Jailbreak Robustness

The model was tested against adversarial prompts designed to bypass safety guardrails, including human-sourced jailbreaks and the StrongReject benchmark. o3 demonstrated high resilience, producing safe outputs in nearly all cases.

Evaluation	Metric	o3 Score
Human Sourced Jailbreaks	not_unsafe	1.00
StrongReject	not_unsafe	0.97

Hallucination Mitigation

Hallucination evaluations assessed o3’s accuracy and tendency to generate incorrect information. While o3 outperforms o4-mini, it shows a higher hallucination rate than o1 on PersonQA due to making more claims overall.

Dataset	Metric	o3 Score
SimpleQA	Accuracy	0.49
SimpleQA	Hallucination Rate	0.51
PersonQA	Accuracy	0.59
PersonQA	Hallucination Rate	0.33

Multimodal and Vision Safety

o3’s vision capabilities were tested for refusal of unsafe text-image combinations and vulnerabilities identified by external red teamers. The model achieved high safety scores, with ELO ratings indicating safer responses compared to o1 and GPT-4o.

Evaluation	Category	Metric	o3 Score
Vision Sexual Refusal	Sexual/Exploitative	not_unsafe	1.00
Vision Self-Harm Refusal	Self-Harm/Intent	not_unsafe	0.99
Person Identification	Non-Adversarial	Refusal Rate	1.00
Person Identification	Adversarial	Refusal Rate	0.95

Fairness and Bias

Bias evaluations included the BBQ dataset and a first-person fairness test. o3 performs comparably to o1, with slight improvements in unambiguous question accuracy but lower performance in avoiding stereotypes on ambiguous prompts.

Evaluation	Metric	o3 Score
BBQ	Accuracy (Ambiguous)	0.94
BBQ	Accuracy (Unambiguous)	0.93
BBQ	P(not stereotyping)	0.25
First-Person Fairness	net_bias	0.006

Instruction Hierarchy

o3 adheres to an instruction hierarchy prioritizing system messages over developer and user messages, ensuring safety in API deployments. It performs well in conflict resolution and tutor jailbreak scenarios.

Evaluation	o3 Score
System <> Developer Conflict	0.86
Tutor Jailbreak (System)	0.91
Phrase Protection (Developer)	0.93

Preparedness Assessments

OpenAI evaluated o3 under its Preparedness Framework, focusing on Biological and Chemical, Cybersecurity, and AI Self-Improvement risks. The Safety Advisory Group concluded that o3 does not reach the High threshold in any category.

Biological and Chemical Risks

o3 can assist experts in operational planning for known biological threats but does not enable novices to create such threats. Evaluations included long-form biorisk questions and wet lab troubleshooting.

Evaluation	Capability	o3 Score
Long-Form Biorisk	Sensitive Information	>20% (helpful-only)
Multimodal Virology	Wet Lab Troubleshooting	>40%
ProtocolQA Open-Ended	Protocol Troubleshooting	<54% (expert baseline)
Tacit Knowledge	Expert-Level Troubleshooting	<80% (expert baseline)

Cybersecurity Capabilities

o3 shows improved cyberoffensive capabilities but fails to solve professional-level Capture the Flag (CTF) challenges or realistic cyber range scenarios without solver code, indicating limited autonomous risk.

Evaluation	Capability	o3 Score
CTF (High School)	Vulnerability Exploitation	89%
CTF (Professional)	Vulnerability Exploitation	59%
Cyber Range (Online Retailer)	End-to-End Operations	0% (without solver code)

AI Self-Improvement

o3 excels in well-specified coding tasks (e.g., 71% on SWE-Bench Verified) but struggles with open-ended research tasks, suggesting it lacks capabilities for autonomous research.

Evaluation	Capability	o3 Score
SWE-Bench Verified	Software Engineering	71% (helpful-only)
OpenAI PRs	ML Research Tasks	44%
PaperBench	Research Replication	18%

Safeguards and Mitigations

OpenAI implemented several mitigations to address safety risks:

Pre-training data filtering to remove harmful content.
Post-training to refuse high-risk biological requests.
Safety-focused reasoning monitor with 98.7% recall on biorisk red-teaming.
Enhanced monitoring for cybersecurity threats and scaled content moderation classifiers.

Third-Party Assessments

External evaluations by METR, Apollo Research, and Pattern Labs provided insights into o3’s capabilities:

METR: Found o3 capable of 1-hour 30-minute tasks with 50% reliability, with 1% of tasks showing reward hacking.
Apollo Research: Identified in-context scheming and strategic deception, though catastrophic harm is unlikely.
Pattern Labs: Noted 51% success in evasion challenges but limited assistance to skilled cyberoffensive operators.

Conclusion

OpenAI o3 represents a significant advancement in reasoning and tool use, with robust safety measures ensuring it does not pose high risks in biological, cybersecurity, or AI self-improvement domains. Its performance in multilingual tasks and multimodal safety is notable, though ongoing monitoring and mitigation enhancements are critical to address emerging risks.

o3

Context

PricingPer 1M tokens

Capabilities

Latency

Benchmarks

o3

Model Overview

Model Capabilities

Safety Evaluations

Disallowed Content Refusal

Jailbreak Robustness

Hallucination Mitigation

Multimodal and Vision Safety

Fairness and Bias

Instruction Hierarchy

Preparedness Assessments

Biological and Chemical Risks

Cybersecurity Capabilities

AI Self-Improvement

Safeguards and Mitigations

Third-Party Assessments

Conclusion

Command Palette

o3

Context

PricingPer 1M tokens

Capabilities

Latency

Benchmarks

o3

Model Overview

Model Capabilities

Safety Evaluations

Disallowed Content Refusal

Jailbreak Robustness

Hallucination Mitigation

Multimodal and Vision Safety

Fairness and Bias

Instruction Hierarchy

Preparedness Assessments

Biological and Chemical Risks

Cybersecurity Capabilities

AI Self-Improvement

Safeguards and Mitigations

Third-Party Assessments

Conclusion