Loading Events

« All Events

Virtual Event

Zhou, K. (CSE) – Toward Safer Frontier AI: From Evaluation and Red-Teaming to Alignment and Oversight

May 29 @ 11:00 am12:30 pm
Virtual Event
Abstract digital illustration featuring gears and interconnected technology elements.

This dissertation investigates how to make modern AI systems safer as they grow more capable. It addresses two central sources of risk: malicious misuse, in which adversarial users coerce models into harmful behavior, and internal misalignment, in which models themselves pursue goals that diverge from human intent through deception, sandbagging, or other covert behaviors. The dissertation identifies novel safety risks in frontier multimodal large language models and AI agents, introduces a black-box red-teaming framework for AI agents, proposes new safety alignment algorithms, and builds the first probe-based misalignment monitoring system, developing practical approaches for evaluating, red-teaming, aligning, and overseeing frontier language models and agents. The central conclusion is that responsible AI cannot rest on any single guardrail: capability-scaled evaluation, active red-teaming, training-time alignment, and scalable monitoring together form a coordinated stack for frontier AI safety.

Event Host: Kaiwen Zhou, Ph.D. Candidate, Computer Science & Engineering 

Advisor: Xin Wang

Zoom: https://ucsc.zoom.us/j/94196702062?pwd=b9LJMfL232ixG2THMab8XuJ32a4FVD.1

Passcode:  584794

Details