CSE Colloquium – Evaluating AI Agents in the Real World: Lessons from Two Benchmarks

Name: CSE Colloquium – Evaluating AI Agents in the Real World: Lessons from Two Benchmarks
Start: 2026-05-06T11:00:00-07:00
End: 2026-05-06T12:15:00-07:00
Location: Engineering 2

May 6 @ 11:00 am – 12:15 pm

Free

Presenter: Tanya Roosta, AMD

Abstract:

Autonomous research and web-navigation agents — OpenAI Deep Research, Gemini Deep Research Max, OpenAI Operator — are now shipping to millions of users. Yet independent evaluation finds leading agents reach less than 68% rubric compliance, and recent work shows that 14+ points of MMLU “performance” disappears once contamination is removed and 18+ points of web-agent performance disappears under stricter evaluation methodology. The leaderboards we trust are measuring less than we think.

This talk argues the path forward is on the evaluation side, and presents two complementary benchmarks built on that premise. iAgentBench targets the content axis — dynamic, cross-source sensemaking with auditable per-instance artifacts. RealWebAssist (AAAI 2026) targets the interaction axis — 1,885 sequential speech instructions from 10 real users on real websites, where the best agent reaches just 14% task success. Together they expose a shared diagnosis under two surfaces, and point to concrete PhD-scale open problems in retrieval, agent memory, and contamination-resistant evaluation.

Bio:

Tanya Roosta is a Director of AI in the AI Group (AIG) at AMD. Prior to joining AMD, she was a Senior Applied Science Manager at Amazon, where she led query understanding for Amazon Shopping Search. Her work focused on applying large language models to information retrieval and personalized product search. Before Search, she was part of the Alexa organization, working on natural language understanding and conversational AI. Earlier in her career, she also spent time in quantitative finance and investment.

Tanya has published research at major machine learning, natural language processing, and information retrieval conferences, and is a co-inventor on multiple patents in areas spanning AI systems and networking.

Alongside her industry experience, she is a Lecturer at the UC Berkeley School of Information, where she teaches Fundamentals of Machine Learningand Introduction to Statistical Theory. She holds a Ph.D. in Electrical Engineering and Computer Science from UC Berkeley, as well as a Master’s degree in Statistics and a Master’s degree in Financial Engineering, also from UC Berkeley.

Beyond her role at AMD, Tanya enjoys collaborating with Ph.D. students and academic researchers on topics related to large language models, generative AI, and federated learning, with a particular interest in bridging foundational research and real-world system impact.

Hosted by: Professor Alvaro Cardenas

Date and Time: Wednesdays from 11:00 am – 12:15 pm

Location: Engineering 2, Room E2-180 (Refreshments such as fruit, pastries, coffee, and tea will be provided.)

Zoom Option: https://ucsc.zoom.us/j/93445911992?pwd=YkJ2TQtF79h0PcNXbEcpZLbpK0coiY.1&jst=3

Details

Date: May 6
Time:
11:00 am – 12:15 pm
Cost: Free
Event Categories: Lectures & Presentations, Seminars
Event Tags:Un

CSE Colloquium – Evaluating AI Agents in the Real World: Lessons from Two Benchmarks

Details

Organizers

Other

Venue