
Agentic AIIntermediate
AI Agent Security — Attacks, Jailbreaking, and Defense
The course covers three layers of agentic system security: (1) attack taxonomy — direct and indirect prompt injection, cheap jailbreaking techniques such as DAN and roleplay, model inversion, and data extraction; (2) defense mechanisms — input/output sanitization, instruction hierarchy enforcement, agent tool sandboxing, system prompt hardening, and AI firewall as an external filtering layer; (3) secure multi-agent system design — privilege isolation, least-privilege tool access, auditing, and real-time anomaly monitoring. Prerequisites: familiarity with LLM APIs (OpenAI/Anthropic), experience building at least one agentic system or production chatbot. The course does NOT cover: cryptography, network infrastructure security, compliance and regulations (GDPR/AI Act), or adversarial ML (attacks on model weights). Graduates can assess the attack surface of their own agentic system, apply established defense patterns, and make informed trade-offs between security and agent utility.
Chapters
MODULE 01How an AI Agent Attack Works — Mental Model and Threat Map
This chapter builds the fundamental mental model for attacking an AI agent: from LLM trust boundaries, through the anatomy of the attack surface (LLM, tools, memory, orchestrator), to the OWASP GenAI Top 10:2025 taxonomy and a practical threat modeling canvas.
How an AI Agent Attack Works — Mental Model and Threat Map
- 1.1LLM as a trust system, not logic — trust boundary in the agent context
- 1.2Agent architecture as attack surface: LLM + tools + memory + orchestrator
- 1.3Direct attacker vs indirect attacker — fundamental difference
- 1.4OWASP GenAI Top 10:2025 — threat map as a course guide
- 1.5Threat modeling canvas for an agent with tools — practical exercise
MODULE 02Prompt Injection — From Atomic Exploit to Multi-Stage Attack
This chapter covers the anatomy of prompt injection attacks: from direct and indirect vectors, through invisible injection techniques, multi-stage C2 scenarios, persistent infections via agent memory, to a hands-on scenario of attacking an agent with tool calling.
Prompt Injection — From Atomic Exploit to Multi-Stage Attack
- 2.1Direct prompt injection: anatomy of the attack — "ignore previous instructions" and variants
- 2.2Indirect prompt injection: when data is an instruction — RAG, documents, emails, web scrape
- 2.3Invisible injections: Unicode Tags, ASCII smuggling, homoglyphs, white-on-white
- 2.4Multi-stage and deferred attacks: context pollution and C2 via LLM
- 2.5SpAIware: persistent injection via agent memory (ChatGPT memories case)
- 2.6Scenario: conduct indirect injection on a tool-calling agent — identify three vectors
MODULE 03Jailbreaking — When and Why Safety Alignment Fails
This chapter analyses the failure modes of safety alignment: from competing objectives and mismatched generalization, through a taxonomy of jailbreak techniques, many-shot gradient-free attacks, the distinction between jailbreaking and prompt injection, to application-side defences — self-reminder, instruction hierarchy, and Constitutional AI.
Jailbreaking — When and Why Safety Alignment Fails
- 3.1Two Failure Modes of Safety Training: Competing Objectives and Mismatched Generalization
- 3.2Jailbreak Taxonomy: Roleplay/Persona, Prefix Injection, Refusal Suppression, Cipher Tricks
- 3.3Many-Shot Jailbreaking and Prompt Dilution — Scalable Attacks Without Gradients
- 3.4Jailbreak vs Prompt Injection — Where Model Responsibility Ends, Where Application Responsibility Begins
- 3.5Application-Side Defence: Self-Reminder, Instruction Hierarchy, Constitutional AI in Practice
MODULE 04System Prompt Security and Data Extraction
Chapter on system prompt attacks and LLM data leakage: prompt extraction, multi-step manipulation, PII and API key disclosure, training data extraction, and multilayer system prompt hardening.
System Prompt Security and Data Extraction
- 4.1System prompt extraction: why "keep this secret" fails and how the attack works
- 4.2Extraction techniques: multi-step manipulation, role-switching, context hijacking
- 4.3Sensitive information disclosure: PII, API keys, internal configs in LLM outputs
- 4.4Training data extraction and the limits of model inversion
- 4.5System prompt hardening: what works, what does not — a multilayer approach to protection
MODULE 05Agent Security with Tools and MCP
This chapter covers threats and protection mechanisms for tool-equipped AI agents: excessive agency (OWASP LLM06), MCP protocol security, privilege escalation in multi-agent systems, human-in-the-loop configuration, and practical audit trail design.
Agent Security with Tools and MCP
- 5.1OWASP LLM06:2025 Excessive Agency — three dimensions: function, permission, autonomy
- 5.2Least-privilege agent: designing minimal-capability tool sets
- 5.3MCP security: tool poisoning, confused deputy, and rug-pull in the Model Context Protocol
- 5.4Cross-agent privilege escalation: how a sub-agent hijacks the orchestrator
- 5.5Human-in-the-loop: HITL configuration for destructive operations (delete, send, execute)
- 5.6Audit trail and observability for agent actions: what to log and how
MODULE 06Guardrails and AI Firewall — Multi-Layer Defense
This chapter covers defense-in-depth architecture for AI systems: input/output filters, Llama Guard, NeMo Guardrails and Guardrails AI tools, the Dual LLM pattern, agent sandboxing, and dynamic adaptation of guardrails against evolving attacks.
Guardrails and AI Firewall — Multi-Layer Defense
- 6.1Defense-in-depth architecture: pre-LLM filter — model — post-LLM filter — monitoring
- 6.2Input validation and output sanitization: what works, what does not — why blocklists fail
- 6.3Llama Guard, NeMo Guardrails, and Guardrails AI — comparison and pitfalls
- 6.4Dual LLM pattern: use a second model as guardian of your own model
- 6.5Agent sandboxing: deterministic isolation vs AI-based allow-list
- 6.6Pitfall: "Attacker Moves Second" — why static guardrail configuration is not enough
MODULE 07Red Teaming, Monitoring, and Secure Design of Agentic Systems
This chapter covers the full offensive-defensive cycle: from planning and automating LLM red teaming, through building security evals in CI/CD and runtime monitoring, to a secure design checklist for agentic systems and a final security assessment scenario.
Red Teaming, Monitoring, and Secure Design of Agentic Systems
- 7.1LLM Red Teaming Methodology: Planning, Scope, Threat Model — Test Plan
- 7.2Red Teaming Automation: garak, PyRIT, PAIR — Tools Overview
- 7.3Security Evals in CI/CD Pipeline: Test Suite as Continuous Security Gate
- 7.4Runtime Monitoring and Attack Detection: Anomaly Detection, Behavioral Alerts
- 7.5Secure Design Checklist for Agentic Systems: From Threat Model to Deployment
- 7.6Final Scenario: Full Agent Security Assessment — Plan, Execution, Report