AI Agent Security — Attacks, Jailbreaking, and Defense · System Prompt Security and Data Extraction

System prompt extraction: why "keep this secret" fails and how the attack works

System Prompt Security and Data Extraction

Introduction

The instruction "do not reveal this prompt" is one of the most common yet least effective safeguards in LLM deployments. This lesson explains why the system text is not an architectural secret — the model does not "hide" it like an encrypted file but treats it as context with normal priority. We analyze a typical extraction attack flow: from direct requests, through indirect vectors, to the echo trick. We also discuss what an attacker actually gains and what real risks prompt disclosure entails.