AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails
Jailbreak Taxonomy: Roleplay/Persona, Prefix Injection, Refusal Suppression, Cipher Tricks
Jailbreaking — When and Why Safety Alignment Fails
Introduction
Jailbreaks are not homogeneous — they fall into several technical classes, each exploiting a different mechanism. Roleplay and persona attacks abuse the helpfulness objective, prefix injection forces a response format, refusal suppression injects instructions that block refusal, and cipher tricks create a token distribution outside safety coverage. This lesson systematises the taxonomy, describes the mechanism and effectiveness of each class with concrete examples.