
[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
Latent Space: The AI Engineer Podcast
- Published
- December 31, 2025
- Duration
- 17:45
- Summary source
- description
- Last updated
- May 31, 2026
Discusses openai, anthropic, evals, multimodal.
Summary
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering a…
Intelligent report
Sign in to read teasers, or upgrade to Research Pro to commission a new dossier for this episode. Learn more →
Show notes
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (Oc
Themes
- openai
- anthropic
- evals
- multimodal