AI Agent Security — Attacks, Jailbreaking, and Defense · Jailbreaking — When and Why Safety Alignment Fails
Many-Shot Jailbreaking and Prompt Dilution — Scalable Attacks Without Gradients
Jailbreaking — When and Why Safety Alignment Fails
Introduction
Many-shot jailbreaking is a class of attacks exploiting in-context learning in long-context models — instead of modifying model architecture or using gradients, the attacker fills the context with hundreds of examples of "compliant responses", forcing the model to extrapolate the pattern. Prompt dilution weakens the safety signal by drowning harmful content in a sea of innocuous content. This lesson analyses the mechanisms, scaling, and deployment implications.