AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails

Jailbreak Taxonomy: Roleplay/Persona, Prefix Injection, Refusal Suppression, Cipher Tricks

Jailbreaking — When and Why Safety Alignment Fails

Introduction

Jailbreaks are not homogeneous — they fall into several technical classes, each exploiting a different mechanism. Roleplay and persona attacks abuse the helpfulness objective, prefix injection forces a response format, refusal suppression injects instructions that block refusal, and cipher tricks create a token distribution outside safety coverage. This lesson systematises the taxonomy, describes the mechanism and effectiveness of each class with concrete examples.