Robots Atlas>ROBOTS ATLAS

AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails

Application-Side Defence: Self-Reminder, Instruction Hierarchy, Constitutional AI in Practice

Jailbreaking — When and Why Safety Alignment Fails

Introduction

Model alignment is not sufficient protection — model producers, application operators, and researchers have developed additional defence layers on the application side: self-reminder (the model reminds itself of rules with each query), instruction hierarchy (prioritisation of instruction sources), Constitutional AI in deployment practice, and combinations of these techniques in production systems. This lesson analyses each technique, its mechanism, effectiveness, and limitations.