Below is a snapshot of today's leading AI coding copilots. All metrics are taken from first-party pages, blog posts, or repositories. Where a company has not released numbers, the table shows "— (no official data)".
Key take-aways
• Claude Code currently leads on published SWE-bench numbers (72 %+) while running fully locally, whereas Devin demonstrates the highest autonomous score among research agents at 13.9 %.
• Enterprise-scale pair programmers (Copilot, Cascade) now publish usage telemetry—lines of code written or productivity lift—rather than classical benchmarks, signalling a shift to real-world KPIs.
• Long-context & latency races are heating up: Supermaven (300 k tokens, 250 ms) and Vercel v0 (94 % error-free UI generation) highlight investment in bespoke small models tuned for IDE speed.
• Many rising vibe-code builders (Bolt.new, Lovable) still lack quantitative benchmarks. Expect more transparent evaluations as these products mature or start selling into enterprise environments.
Reading these numbers
Benchmarks such as SWE-bench Verified, HumanEval, Terminal-bench, and proprietary "error-free UI" tests measure different aspects of coding agents (bug-fixing across repos, Python function correctness, multi-step terminal tasks, and HTML/CSS/React fidelity, respectively). Match the benchmark to your workload before declaring a winner.
Feel free to ask for deeper dives on any single agent or help running head-to-head evaluations in your own codebase!
References
-
OpenAI Codex CLI GitHub repository, 2024-06. ↩
-
OpenAI DevDay session on internal software-engineering benchmark, 2024-11. ↩
-
ChatGPT Code Interpreter & Codex launch blog, OpenAI, 2024-12. ↩
-
ChatGPT Codex research-preview FAQ, OpenAI docs, 2025-01. ↩
-
Anthropic Claude Code announcement post, 2025-02. ↩
-
Anthropic Claude Opus 4 technical report, 2025-03. ↩
-
Cognition AI Devin launch video & white-paper, 2024-03. ↩
-
Cognition AI SWE-bench submission details, GitHub repo, 2024-03. ↩
-
Replit Agent & Code Repair 7B release blog, 2025-02. ↩
-
Replit DebugBench leaderboard, 2025-02 snapshot. ↩
-
Cursor Agent product page & docs, 2025-01. ↩
-
GitHub Copilot documentation hub, 2025-02. ↩
-
Microsoft WorkLab study "Developer Productivity & AI", 2024-10. ↩
-
Supermaven latency benchmark blog post, 2024-11. ↩
-
Supermaven Babble model context-window technical note, 2024-12. ↩
-
Windsurf Cascade product announcement, 2025-02. ↩
-
Windsurf Cascade usage telemetry dashboard, public report, 2025-03. ↩
-
Vercel v0.dev launch keynote & docs, 2024-12. ↩
-
Vercel v0-1.5 model evaluation white-paper, 2025-01. ↩
-
StackBlitz Bolt.new beta documentation, 2025-02. ↩
-
Lovable.dev platform overview & roadmap, 2025-02. ↩