Results for "evals"

Keyword scan across titles, descriptions, summaries, and tags. For interview listings, try Guest appearances.

8 results

Episodes

StandardSummaries only
⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data
Latent Space: The AI Engineer Podcast· Feb 23, 2026
Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-ver…
openaievals
StandardSummaries only
[State of Evals] LMArena's $1.7B Vision — Anastasios Angelopoulos, LMArena
Latent Space: The AI Engineer Podcast· Jan 6, 2026
We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B v…
evals
StandardSummaries only
Giving Agents Computers — Ivan Burazin, Daytona
Latent Space: The AI Engineer Podcast· May 21, 2026
Take the 2026 AI Engineering Survey and get >$2k in credits and AIE WF tickets!On the product side, everyone is getting Computer - Perplexity, Manus, Cursor, and so on. Meanwhile on the research side, agentic evals like …
llmagentsevals
StandardSummaries only
[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
Latent Space: The AI Engineer Podcast· Dec 31, 2025
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard fo…
openaianthropicevalsmultimodal
StandardSummaries only
Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Latent Space: The AI Engineer Podcast· Jun 4, 2026
The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets!Most industry benchmarks compress intelligence and reaso…
evals
StandardSummaries only
How to Find the Agent Failures Your Evals Miss with Scott Clark
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)· Scott Clark· May 7, 2026
In this episode, Scott Clark, co-founder and CEO of Distributional, joins us to explore how teams can reliably operate and improve complex LLM systems and agents in production. Scott introduces a Maslow’s hierarchy of ob…
llmevals
StandardSummaries only
METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity
Latent Space: The AI Engineer Podcast· Feb 27, 2026
This is a free preview of a paid episode. To hear more, visit www.latent.spaceAIE Europe CFP and AIE World’s Fair paper submissions for CAIS peer review are due TODAY - do not delay! Last call ever.We’re excited to welco…
evals
StandardSummaries only
Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith
Latent Space: The AI Engineer Podcast· Artificial Analysis· Jan 8, 2026
Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we’ll explain in the next State of Latent Space post, we’ll be doubling down on Substack again and impr…
llmevals