BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Events - ECPv6.15.20//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:Events
X-ORIGINAL-URL:https://events.ucsc.edu
X-WR-CALDESC:Events for Events
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20250309T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20251102T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20260308T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20261101T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20270314T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20271107T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20260506T110000
DTEND;TZID=America/Los_Angeles:20260506T121500
DTSTAMP:20260427T201620
CREATED:20260427T203808Z
LAST-MODIFIED:20260427T203808Z
UID:10013995-1778065200-1778069700@events.ucsc.edu
SUMMARY:CSE Colloquium - Evaluating AI Agents in the Real World: Lessons from Two Benchmarks
DESCRIPTION:Presenter: Tanya Roosta\, AMD \nAbstract: \nAutonomous research and web-navigation agents — OpenAI Deep Research\, Gemini Deep Research Max\, OpenAI Operator — are now shipping to millions of users. Yet independent evaluation finds leading agents reach less than 68% rubric compliance\, and recent work shows that 14+ points of MMLU “performance” disappears once contamination is removed and 18+ points of web-agent performance disappears under stricter evaluation methodology. The leaderboards we trust are measuring less than we think. \nThis talk argues the path forward is on the evaluation side\, and presents two complementary benchmarks built on that premise. iAgentBench targets the content axis — dynamic\, cross-source sensemaking with auditable per-instance artifacts. RealWebAssist (AAAI 2026) targets the interaction axis — 1\,885 sequential speech instructions from 10 real users on real websites\, where the best agent reaches just 14% task success. Together they expose a shared diagnosis under two surfaces\, and point to concrete PhD-scale open problems in retrieval\, agent memory\, and contamination-resistant evaluation. \nBio: \nTanya Roosta is a Director of AI in the AI Group (AIG) at AMD. Prior to joining AMD\, she was a Senior Applied Science Manager at Amazon\, where she led query understanding for Amazon Shopping Search. Her work focused on applying large language models to information retrieval and personalized product search. Before Search\, she was part of the Alexa organization\, working on natural language understanding and conversational AI. Earlier in her career\, she also spent time in quantitative finance and investment. \nTanya has published research at major machine learning\, natural language processing\, and information retrieval conferences\, and is a co-inventor on multiple patents in areas spanning AI systems and networking. \nAlongside her industry experience\, she is a Lecturer at the UC Berkeley School of Information\, where she teaches Fundamentals of Machine Learningand Introduction to Statistical Theory. She holds a Ph.D. in Electrical Engineering and Computer Science from UC Berkeley\, as well as a Master’s degree in Statistics and a Master’s degree in Financial Engineering\, also from UC Berkeley. \nBeyond her role at AMD\, Tanya enjoys collaborating with Ph.D. students and academic researchers on topics related to large language models\, generative AI\, and federated learning\, with a particular interest in bridging foundational research and real-world system impact. \nHosted by: Professor Alvaro Cardenas \nDate and Time: Wednesdays from 11:00 am – 12:15 pm \nLocation: Engineering 2\, Room E2-180 (Refreshments such as fruit\, pastries\, coffee\, and tea will be provided.) \nZoom Option: https://ucsc.zoom.us/j/93445911992?pwd=YkJ2TQtF79h0PcNXbEcpZLbpK0coiY.1&jst=3
URL:https://events.ucsc.edu/event/cse-colloquium-evaluating-ai-agents-in-the-real-world-lessons-from-two-benchmarks/
LOCATION:Engineering 2\, Engineering 2 1156 High Street\, Santa Cruz\, CA\, 95064
CATEGORIES:Lectures & Presentations,Seminars
ATTACH;FMTTYPE=image/png:https://events.ucsc.edu/wp-content/uploads/2026/03/BElogoWHITE.png
GEO:37.0009723;-122.0632371
X-APPLE-STRUCTURED-LOCATION;VALUE=URI;X-ADDRESS=Engineering 2 Engineering 2 1156 High Street Santa Cruz CA 95064;X-APPLE-RADIUS=500;X-TITLE=Engineering 2 1156 High Street:geo:-122.0632371,37.0009723
END:VEVENT
END:VCALENDAR