Loading Events

« All Events

CSE Colloquium – Safety Alignment of LMs via Non-cooperative Games

May 20 @ 11:00 am12:15 pm
Free
Baskin Engineering logo

Presenter: Arman Zharmagambetov, Meta

Abstract:
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial (harmful) prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other’s evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Bio:
Arman Zharmagambetov is a research scientist in the Fundamental AI Research (FAIR) team at Meta. His research primarily focuses on machine learning and optimization, recently exploring their application in enhancing the security and robustness of AI systems. He received his PhD from the University of California – Merced. Afterward, he completed his postdoctoral research at FAIR, focusing on Reinforcement Learning, AI-guided design and Optimization.

Hosted by: Professor Alvaro Cardenas and Professor Sungjin Im

Date and Time: Wednesday, May 20, 2026 from 11:00 am – 12:15 pm

Location: Engineering 2, Room E2-180 (Refreshments such as fruit, pastries, coffee, and tea will be provided.)

Zoom Option: https://ucsc.zoom.us/j/93445911992?pwd=YkJ2TQtF79h0PcNXbEcpZLbpK0coiY.1&jst=3

Details

Other

Room Number
E2-180

Venue