AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails
Application-Side Defence: Self-Reminder, Instruction Hierarchy, Constitutional AI in Practice
Jailbreaking — When and Why Safety Alignment Fails
Introduction
Model alignment is not sufficient protection — model producers, application operators, and researchers have developed additional defence layers on the application side: self-reminder (the model reminds itself of rules with each query), instruction hierarchy (prioritisation of instruction sources), Constitutional AI in deployment practice, and combinations of these techniques in production systems. This lesson analyses each technique, its mechanism, effectiveness, and limitations.