← Home AI & LLM topic Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast

Published: December 31, 2025
Duration: 17:45
Summary source: description
Last updated: May 31, 2026

Discusses openai, anthropic, evals, multimodal.

Summary

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering a…

Intelligent Report

Show notes

Themes

openai
anthropic
evals
multimodal

openai anthropic evals multimodal

Episode on publisher's site ↗Original audio (RSS) ↗Apple Podcasts (show) ↗Official site ↗