AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails
Two Failure Modes of Safety Training: Competing Objectives and Mismatched Generalization
Jailbreaking — When and Why Safety Alignment Fails
Introduction
Safety alignment of language models relies on RLHF and SFT to teach the model to refuse harmful requests. However, this training has two fundamental weak points: competing objectives (the goal of being helpful competes with the goal of being safe) and mismatched generalization (the model learns safety on the training distribution, but jailbreaks move outside it). This lesson dissects both mechanisms, presents concrete experiments, and discusses implications for AI system design.