Zhou, K. (CSE) – Toward Safer Frontier AI: From Evaluation and Red-Teaming to Alignment and Oversight
This dissertation investigates how to make modern AI systems safer as they grow more capable. It addresses two central sources of risk: malicious misuse, in which adversarial users coerce models into harmful behavior, and internal misalignment, in which models themselves pursue goals that diverge from human intent through deception, sandbagging, or other covert behaviors. The […]