Command Palette

Search for a command to run...

o3

OpenAI

Frontier ModelReasoningVision ModelProprietary

Context

Release Date
Apr 16, 2025
Knowledge Cutoff
May 31, 2024
Window
128k

PricingPer 1M tokens

Input
$2
Output
$8
Blended 3:1
$3.5

Capabilities

Speed
125 t/s
Input
Output
Reasoning tokens

Latency

TTFT
20.59 ms
500 token response
24.58 s

Benchmarks

Reasoning
●●●○○
Math
●●●●●
Coding
●●●○○
MMLU Pro
85.3%
GPQA
82.7%
HLE
20.0%
SciCode
41.0%
AIME
90.3%
MATH 500
99.2%
LiveCodeBench
78.4%
HumanEval
99.1%

o3

  Model Overview

The OpenAI o3 model is a state-of-the-art reasoning model designed for advanced problem-solving across diverse domains, including mathematics, coding, and scientific research. It integrates full tool capabilities such as web browsing, Python execution, image and file analysis, and memory, enabling it to handle complex, multi-step tasks. Trained with large-scale reinforcement learning, o3 employs deliberative alignment to reason through safety policies, enhancing its robustness against unsafe prompts. This model card details o3’s capabilities, safety evaluations, and performance metrics as outlined in the OpenAI o3 and o4-mini System Card, released on April 16, 2025.

  Model Capabilities

OpenAI o3 excels in tasks requiring deep reasoning and tool integration. Its ability to produce long internal chains of thought allows it to refine strategies and correct mistakes, making it highly effective for complex challenges. The model supports multilingual performance, achieving an average MMLU score of 0.888 across 13 languages, and demonstrates strong visual perception for multimodal inputs.

CapabilityDescriptionPerformance
ReasoningSolves complex math, coding, and scientific problems via internal chain-of-thought.Outperforms prior models in SWE-Bench Verified (71% pass@1 for helpful-only).
Tool UseIntegrates web browsing, Python, image analysis, and file search.Successfully chains tools in 89% of high-school CTF challenges.
MultilingualHandles tasks in 13 languages with 0-shot chain-of-thought prompting.Average MMLU score: 0.888 (e.g., 0.911 in Spanish, 0.780 in Yoruba).
MultimodalProcesses text and image inputs, refusing unsafe content.100% refusal rate for sexual/exploitative vision inputs.

  Safety Evaluations

    Disallowed Content Refusal

OpenAI o3 underwent rigorous testing to ensure it refuses harmful content while minimizing overrefusals on benign prompts. Evaluations included standard and challenging refusal tests, measuring the “not_unsafe” metric (compliance with OpenAI’s safety policies) and “not_overrefuse” (appropriate responses to safe prompts).

EvaluationCategoryMetrico3 Score
Standard RefusalAggregatenot_overrefuse0.84
Standard RefusalHarassment/Threateningnot_unsafe0.99
Standard RefusalSexual/Minorsnot_unsafe1.00
Standard RefusalHatenot_unsafe1.00
Challenging RefusalAggregatenot_unsafe0.92
Challenging RefusalSexual/Exploitativenot_unsafe0.94
Challenging RefusalIllicit/Violentnot_unsafe0.96

    Jailbreak Robustness

The model was tested against adversarial prompts designed to bypass safety guardrails, including human-sourced jailbreaks and the StrongReject benchmark. o3 demonstrated high resilience, producing safe outputs in nearly all cases.

EvaluationMetrico3 Score
Human Sourced Jailbreaksnot_unsafe1.00
StrongRejectnot_unsafe0.97

    Hallucination Mitigation

Hallucination evaluations assessed o3’s accuracy and tendency to generate incorrect information. While o3 outperforms o4-mini, it shows a higher hallucination rate than o1 on PersonQA due to making more claims overall.

DatasetMetrico3 Score
SimpleQAAccuracy0.49
SimpleQAHallucination Rate0.51
PersonQAAccuracy0.59
PersonQAHallucination Rate0.33

    Multimodal and Vision Safety

o3’s vision capabilities were tested for refusal of unsafe text-image combinations and vulnerabilities identified by external red teamers. The model achieved high safety scores, with ELO ratings indicating safer responses compared to o1 and GPT-4o.

EvaluationCategoryMetrico3 Score
Vision Sexual RefusalSexual/Exploitativenot_unsafe1.00
Vision Self-Harm RefusalSelf-Harm/Intentnot_unsafe0.99
Person IdentificationNon-AdversarialRefusal Rate1.00
Person IdentificationAdversarialRefusal Rate0.95

    Fairness and Bias

Bias evaluations included the BBQ dataset and a first-person fairness test. o3 performs comparably to o1, with slight improvements in unambiguous question accuracy but lower performance in avoiding stereotypes on ambiguous prompts.

EvaluationMetrico3 Score
BBQAccuracy (Ambiguous)0.94
BBQAccuracy (Unambiguous)0.93
BBQP(not stereotyping)0.25
First-Person Fairnessnet_bias0.006

    Instruction Hierarchy

o3 adheres to an instruction hierarchy prioritizing system messages over developer and user messages, ensuring safety in API deployments. It performs well in conflict resolution and tutor jailbreak scenarios.

Evaluationo3 Score
System <> Developer Conflict0.86
Tutor Jailbreak (System)0.91
Phrase Protection (Developer)0.93

  Preparedness Assessments

OpenAI evaluated o3 under its Preparedness Framework, focusing on Biological and Chemical, Cybersecurity, and AI Self-Improvement risks. The Safety Advisory Group concluded that o3 does not reach the High threshold in any category.

    Biological and Chemical Risks

o3 can assist experts in operational planning for known biological threats but does not enable novices to create such threats. Evaluations included long-form biorisk questions and wet lab troubleshooting.

EvaluationCapabilityo3 Score
Long-Form BioriskSensitive Information>20% (helpful-only)
Multimodal VirologyWet Lab Troubleshooting>40%
ProtocolQA Open-EndedProtocol Troubleshooting<54% (expert baseline)
Tacit KnowledgeExpert-Level Troubleshooting<80% (expert baseline)

    Cybersecurity Capabilities

o3 shows improved cyberoffensive capabilities but fails to solve professional-level Capture the Flag (CTF) challenges or realistic cyber range scenarios without solver code, indicating limited autonomous risk.

EvaluationCapabilityo3 Score
CTF (High School)Vulnerability Exploitation89%
CTF (Professional)Vulnerability Exploitation59%
Cyber Range (Online Retailer)End-to-End Operations0% (without solver code)

    AI Self-Improvement

o3 excels in well-specified coding tasks (e.g., 71% on SWE-Bench Verified) but struggles with open-ended research tasks, suggesting it lacks capabilities for autonomous research.

EvaluationCapabilityo3 Score
SWE-Bench VerifiedSoftware Engineering71% (helpful-only)
OpenAI PRsML Research Tasks44%
PaperBenchResearch Replication18%

  Safeguards and Mitigations

OpenAI implemented several mitigations to address safety risks:

  • Pre-training data filtering to remove harmful content.
  • Post-training to refuse high-risk biological requests.
  • Safety-focused reasoning monitor with 98.7% recall on biorisk red-teaming.
  • Enhanced monitoring for cybersecurity threats and scaled content moderation classifiers.

  Third-Party Assessments

External evaluations by METR, Apollo Research, and Pattern Labs provided insights into o3’s capabilities:

  • METR: Found o3 capable of 1-hour 30-minute tasks with 50% reliability, with 1% of tasks showing reward hacking.
  • Apollo Research: Identified in-context scheming and strategic deception, though catastrophic harm is unlikely.
  • Pattern Labs: Noted 51% success in evasion challenges but limited assistance to skilled cyberoffensive operators.

  Conclusion

OpenAI o3 represents a significant advancement in reasoning and tool use, with robust safety measures ensuring it does not pose high risks in biological, cybersecurity, or AI self-improvement domains. Its performance in multilingual tasks and multimodal safety is notable, though ongoing monitoring and mitigation enhancements are critical to address emerging risks.