CSE Colloquium – Evaluating AI Agents in the Real World: Lessons from Two Benchmarks
Presenter: Tanya Roosta, AMD Abstract: Autonomous research and web-navigation agents — OpenAI Deep Research, Gemini Deep Research Max, OpenAI Operator — are now shipping to millions of users. Yet independent evaluation finds leading agents reach less than 68% rubric compliance, and recent work shows that 14+ points of MMLU “performance” disappears once contamination is removed […]