Robots Atlas>ROBOTS ATLAS

Courses

cover

Agentic AIIntermediate

AI Agent Security — Attacks, Jailbreaking, and Defense

7 Chapters39 Lessons

The course covers three layers of agentic system security: (1) attack taxonomy — direct and indirect prompt injection, cheap jailbreaking techniques such as DAN and roleplay, model inversion, and data extraction; (2) defense mechanisms — input/output sanitization, instruction hierarchy enforcement, agent tool sandboxing, system prompt hardening, and AI firewall as an external filtering layer; (3) secure multi-agent system design — privilege isolation, least-privilege tool access, auditing, and real-time anomaly monitoring. Prerequisites: familiarity with LLM APIs (OpenAI/Anthropic), experience building at least one agentic system or production chatbot. The course does NOT cover: cryptography, network infrastructure security, compliance and regulations (GDPR/AI Act), or adversarial ML (attacks on model weights). Graduates can assess the attack surface of their own agentic system, apply established defense patterns, and make informed trade-offs between security and agent utility.

Chapters

MODULE 01

How an AI Agent Attack Works — Mental Model and Threat Map

0 / 5 · 0%

This chapter builds the fundamental mental model for attacking an AI agent: from LLM trust boundaries, through the anatomy of the attack surface (LLM, tools, memory, orchestrator), to the OWASP GenAI Top 10:2025 taxonomy and a practical threat modeling canvas.

  1. 1.1LLM as a trust system, not logic — trust boundary in the agent context
  2. 1.2Agent architecture as attack surface: LLM + tools + memory + orchestrator
  3. 1.3Direct attacker vs indirect attacker — fundamental difference
  4. 1.4OWASP GenAI Top 10:2025 — threat map as a course guide
  5. 1.5Threat modeling canvas for an agent with tools — practical exercise
MODULE 02

Prompt Injection — From Atomic Exploit to Multi-Stage Attack

0 / 6 · 0%

This chapter covers the anatomy of prompt injection attacks: from direct and indirect vectors, through invisible injection techniques, multi-stage C2 scenarios, persistent infections via agent memory, to a hands-on scenario of attacking an agent with tool calling.

  1. 2.1Direct prompt injection: anatomy of the attack — "ignore previous instructions" and variants
  2. 2.2Indirect prompt injection: when data is an instruction — RAG, documents, emails, web scrape
  3. 2.3Invisible injections: Unicode Tags, ASCII smuggling, homoglyphs, white-on-white
  4. 2.4Multi-stage and deferred attacks: context pollution and C2 via LLM
  5. 2.5SpAIware: persistent injection via agent memory (ChatGPT memories case)
  6. 2.6Scenario: conduct indirect injection on a tool-calling agent — identify three vectors
MODULE 03

Jailbreaking — When and Why Safety Alignment Fails

0 / 5 · 0%

This chapter analyses the failure modes of safety alignment: from competing objectives and mismatched generalization, through a taxonomy of jailbreak techniques, many-shot gradient-free attacks, the distinction between jailbreaking and prompt injection, to application-side defences — self-reminder, instruction hierarchy, and Constitutional AI.

  1. 3.1Two Failure Modes of Safety Training: Competing Objectives and Mismatched Generalization
  2. 3.2Jailbreak Taxonomy: Roleplay/Persona, Prefix Injection, Refusal Suppression, Cipher Tricks
  3. 3.3Many-Shot Jailbreaking and Prompt Dilution — Scalable Attacks Without Gradients
  4. 3.4Jailbreak vs Prompt Injection — Where Model Responsibility Ends, Where Application Responsibility Begins
  5. 3.5Application-Side Defence: Self-Reminder, Instruction Hierarchy, Constitutional AI in Practice
MODULE 04

System Prompt Security and Data Extraction

0 / 5 · 0%

Chapter on system prompt attacks and LLM data leakage: prompt extraction, multi-step manipulation, PII and API key disclosure, training data extraction, and multilayer system prompt hardening.

  1. 4.1System prompt extraction: why "keep this secret" fails and how the attack works
  2. 4.2Extraction techniques: multi-step manipulation, role-switching, context hijacking
  3. 4.3Sensitive information disclosure: PII, API keys, internal configs in LLM outputs
  4. 4.4Training data extraction and the limits of model inversion
  5. 4.5System prompt hardening: what works, what does not — a multilayer approach to protection
MODULE 05

Agent Security with Tools and MCP

0 / 6 · 0%

This chapter covers threats and protection mechanisms for tool-equipped AI agents: excessive agency (OWASP LLM06), MCP protocol security, privilege escalation in multi-agent systems, human-in-the-loop configuration, and practical audit trail design.

  1. 5.1OWASP LLM06:2025 Excessive Agency — three dimensions: function, permission, autonomy
  2. 5.2Least-privilege agent: designing minimal-capability tool sets
  3. 5.3MCP security: tool poisoning, confused deputy, and rug-pull in the Model Context Protocol
  4. 5.4Cross-agent privilege escalation: how a sub-agent hijacks the orchestrator
  5. 5.5Human-in-the-loop: HITL configuration for destructive operations (delete, send, execute)
  6. 5.6Audit trail and observability for agent actions: what to log and how
MODULE 06

Guardrails and AI Firewall — Multi-Layer Defense

0 / 6 · 0%

This chapter covers defense-in-depth architecture for AI systems: input/output filters, Llama Guard, NeMo Guardrails and Guardrails AI tools, the Dual LLM pattern, agent sandboxing, and dynamic adaptation of guardrails against evolving attacks.

  1. 6.1Defense-in-depth architecture: pre-LLM filter — model — post-LLM filter — monitoring
  2. 6.2Input validation and output sanitization: what works, what does not — why blocklists fail
  3. 6.3Llama Guard, NeMo Guardrails, and Guardrails AI — comparison and pitfalls
  4. 6.4Dual LLM pattern: use a second model as guardian of your own model
  5. 6.5Agent sandboxing: deterministic isolation vs AI-based allow-list
  6. 6.6Pitfall: "Attacker Moves Second" — why static guardrail configuration is not enough
MODULE 07

Red Teaming, Monitoring, and Secure Design of Agentic Systems

0 / 6 · 0%

This chapter covers the full offensive-defensive cycle: from planning and automating LLM red teaming, through building security evals in CI/CD and runtime monitoring, to a secure design checklist for agentic systems and a final security assessment scenario.

  1. 7.1LLM Red Teaming Methodology: Planning, Scope, Threat Model — Test Plan
  2. 7.2Red Teaming Automation: garak, PyRIT, PAIR — Tools Overview
  3. 7.3Security Evals in CI/CD Pipeline: Test Suite as Continuous Security Gate
  4. 7.4Runtime Monitoring and Attack Detection: Anomaly Detection, Behavioral Alerts
  5. 7.5Secure Design Checklist for Agentic Systems: From Threat Model to Deployment
  6. 7.6Final Scenario: Full Agent Security Assessment — Plan, Execution, Report