AI Agent Security — Attacks, Jailbreaking, and Defense · System Prompt Security and Data Extraction
Training data extraction and the limits of model inversion
System Prompt Security and Data Extraction
Introduction
Language models do not just "know" — they "remember". Carlini et al. (2021) showed that verbatim training data fragments, including personal data, can be extracted from GPT-2. This lesson covers two related areas: training data extraction — how attackers can recover specific data from a model through carefully crafted queries, and model inversion — how to reconstruct input data properties from model outputs. We also analyze the limits of these attacks, memorization metrics, and differential privacy-based defenses.