AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails

Many-Shot Jailbreaking and Prompt Dilution — Scalable Attacks Without Gradients

Jailbreaking — When and Why Safety Alignment Fails

Introduction

Many-shot jailbreaking is a class of attacks exploiting in-context learning in long-context models — instead of modifying model architecture or using gradients, the attacker fills the context with hundreds of examples of "compliant responses", forcing the model to extrapolate the pattern. Prompt dilution weakens the safety signal by drowning harmful content in a sea of innocuous content. This lesson analyses the mechanisms, scaling, and deployment implications.