AI Agent Security — Attacks, Jailbreaking, and Defense · How an AI Agent Attack Works — Mental Model and Threat Map
LLM as a trust system, not logic — trust boundary in the agent context
How an AI Agent Attack Works — Mental Model and Threat Map
Introduction
An LLM does not verify the truth of its inputs — it processes them according to the distribution learned during pretraining. This lesson builds the fundamental mental model: LLM as a statistical trust system, not a deterministic logic engine. It covers the concept of trust boundary in the AI agent context: who is a trusted sender, what "trust level" of a token in a sequence means, and why classical security mechanisms (whitelist, sandbox, policy) do not transfer 1:1 to an agent with an LLM.