Results for "evals"

3 results

Episodes

  • Latent Space: The AI Engineer Podcast
    StandardSummaries only

    ⚡️The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

    Latent Space: The AI Engineer Podcast· Feb 23, 2026

    Olivia Watkins (Frontier Evals team) and Mia Glaese (VP of Research at OpenAI, leading the Codex, human data, and alignment teams) discuss a new blog post (https://openai.com/index/why-we-no-longer-evaluate-swe-bench-ver

    openaievals
  • The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
    StandardSummaries only

    How to Find the Agent Failures Your Evals Miss with Scott Clark

    The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)· Scott Clark· May 7, 2026

    In this episode, Scott Clark, co-founder and CEO of Distributional, joins us to explore how teams can reliably operate and improve complex LLM systems and agents in production. Scott introduces a Maslow’s hierarchy of ob

    llmevals
  • Latent Space: The AI Engineer Podcast
    StandardSummaries only

    METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

    Latent Space: The AI Engineer Podcast· Feb 27, 2026

    This is a free preview of a paid episode. To hear more, visit www.latent.spaceAIE Europe CFP and AIE World’s Fair paper submissions for CAIS peer review are due TODAY - do not delay! Last call ever.We’re excited to welco

    evals