
Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Latent Space: The AI Engineer Podcast
- Published
- June 4, 2026
- Duration
- 1h 15m
- Summary source
- description
- Last updated
- Jun 10, 2026
Discusses evals.
Summary
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity’s Last Exam, etc. These metrics are useful, but don’t always represent the full ex…
Intelligent report
Sign in to read teasers, or upgrade to Research Pro to commission a new dossier for this episode. Learn more →
Show notes
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reasoning ability into scores.SWE-Bench Pro, MMLU, Humanity’s Last Exam, etc. These metrics are useful, but don’t always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesse
Themes
- evals