Robots Atlas>ROBOTS ATLAS

AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails

Two Failure Modes of Safety Training: Competing Objectives and Mismatched Generalization

Jailbreaking — When and Why Safety Alignment Fails

Introduction

Safety alignment of language models relies on RLHF and SFT to teach the model to refuse harmful requests. However, this training has two fundamental weak points: competing objectives (the goal of being helpful competes with the goal of being safe) and mismatched generalization (the model learns safety on the training distribution, but jailbreaks move outside it). This lesson dissects both mechanisms, presents concrete experiments, and discusses implications for AI system design.