CSE Colloquium – Safety Alignment of LMs via Non-cooperative Games
Presenter: Arman Zharmagambetov, Meta Abstract: Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial (harmful) prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an […]