Presenter: Tanya Roosta, AMD Abstract: Autonomous research and web-navigation agents โ OpenAI Deep Research, Gemini Deep Research Max, OpenAI Operator โ are now shipping to millions of users. Yet independent evaluation finds leading agents reach less than 68% rubric compliance, and recent work shows that 14+ points of MMLU “performance” disappears once contamination is removed […]
Free