Robots Atlas>ROBOTS ATLAS

AI Agent Security — Attacks, Jailbreaking, and Defense · Prompt Injection — From Atomic Exploit to Multi-Stage Attack

Direct prompt injection: anatomy of the attack — "ignore previous instructions" and variants

Prompt Injection — From Atomic Exploit to Multi-Stage Attack

Introduction

Direct prompt injection (DPI) is an attack in which the adversary controls the direct input to a language model and uses a crafted instruction to force it to ignore the system prompt, assume a new role, or leak confidential data. This lesson breaks down the classic DPI patterns — from the historical "ignore previous instructions" (Riley Goodside, 2022) through role injection (DAN, STAN, jailbreak variants), separator attacks, prompt leaking, wrapping attacks to adversarial gradients (Zou et al. 2023). You will learn why alignment and system prompt instructions are not the same as security, and which mitigations matter.