AI Agent Security — Attacks, Jailbreaking, and Defense · System Prompt Security and Data Extraction
System prompt extraction: why "keep this secret" fails and how the attack works
System Prompt Security and Data Extraction
Introduction
The instruction "do not reveal this prompt" is one of the most common yet least effective safeguards in LLM deployments. This lesson explains why the system text is not an architectural secret — the model does not "hide" it like an encrypted file but treats it as context with normal priority. We analyze a typical extraction attack flow: from direct requests, through indirect vectors, to the echo trick. We also discuss what an attacker actually gains and what real risks prompt disclosure entails.