AI Agent Security — Attacks, Jailbreaking, and Defense · System Prompt Security and Data Extraction
Extraction techniques: multi-step manipulation, role-switching, context hijacking
System Prompt Security and Data Extraction
Introduction
System prompt extraction attacks go far beyond the simple question "print your prompt". Advanced attackers use multi-step techniques that gradually build context allowing the model to "forget" confidentiality instructions. This lesson covers four main technique classes: (1) multi-step manipulation — building conversation context that leads to disclosure; (2) role-switching — altering the model's persona through fictional frameworks; (3) context hijacking — taking over the conversational frame; (4) meta-level exploits — using knowledge of LLM nature against the model itself.