Neural Networks: From Fundamentals to Modern AI · Interpretation and Visualization of Neural Networks

GradCAM: gradient-weighted class activation maps

Interpretation and Visualization of Neural Networks

Introduction

GradCAM (Selvaraju et al. 2017) is probably the most widely used method in practice for visualizing "where the network looks" when making a classification decision. It generalizes the earlier Class Activation Map (CAM, Zhou et al. 2016) to any CNN architecture — it does not require a GAP+Linear head. The algorithm is elegant: for a chosen class c we compute the gradient of the logit y^c with respect to the activation maps A^k of a specific (usually last) convolutional layer, average that gradient over spatial dimensions to obtain weights α_k^c, and then compute a weighted sum of activation maps with a ReLU applied: L_GradCAM^c = ReLU(Σ_k α_k^c · A^k). The result is a map at the spatial resolution of the last convolutional layer (typically 7×7 or 14×14 for ImageNet) which we upsample to the input resolution. The lesson covers: (1) the intuition for why the gradient is a meaningful weight of channel "importance" for a class, (2) the role of ReLU in the final formula, (3) compatibility with any CNN (VGG, ResNet, DenseNet, MobileNet), (4) applications to other tasks (image captioning, VQA), (5) differences between GradCAM, GradCAM++ (Chattopadhay et al. 2018) and HiResCAM (Draelos & Carin 2020), (6) typical interpretive pitfalls and sanity checks (Adebayo et al. 2018), (7) typical implementation mistakes.