Neural Networks: From Fundamentals to Modern AI · Interpretation and Visualization of Neural Networks

Adversarial examples — when the network fails and why

Interpretation and Visualization of Neural Networks

Introduction

In 2013 Szegedy et al. discovered a surprising phenomenon: any correctly classified image can have a small, human-invisible perturbation added that flips the network's decision to a wrong one with high confidence. These "adversarial examples" became one of the most important fundamental questions in deep learning: why are models with 95%+ on ImageNet so fragile? Goodfellow et al. (2015) proposed a linear explanation (networks are "too linear" in high-dimensional input space) along with the simplest attack — Fast Gradient Sign Method (FGSM): x_adv = x + ε · sign(∇_x L(θ, x, y)). The lesson covers: (1) the geometric intuition — high input dimensionality + limited model capacity = vast regions "near" the image with a different classification, (2) the hierarchy of attacks: white-box vs black-box, untargeted vs targeted, single-step (FGSM) vs iterative (BIM, PGD), (3) specific perturbation budgets (L∞, L2, L0), (4) attack transferability across models (Papernot 2016), (5) physical attacks (Athalye 2018, road-sign stickers Eykholt 2018), (6) defenses: adversarial training (Madry et al. 2018), defensive distillation (discredited), certified defenses (randomized smoothing, Cohen et al. 2019), (7) why "the network fails" is not a software bug but a fundamental phenomenon revealing that our models learn something other than "image understanding".