Neural Networks: From Fundamentals to Modern AI · Convolutional Neural Networks (CNN)

2D convolution: filter as feature detector, padding, stride, equivariance

Convolutional Neural Networks (CNN)

Introduction

2D convolution is the fundamental operation of convolutional networks: a filter (kernel) of small dimensions (typically 3×3 or 5×5) is slid over the image and at each position produces a scalar output — the dot product of the input window and the filter weights, plus a bias. The same learned weight matrix is used at every position (weight sharing), which gives the network two key properties: a huge reduction in parameter count compared to a fully connected layer, and **translation equivariance** — shifting an object in the image shifts the corresponding activation in the feature map by the same vector. Equivariance is NOT invariance: invariance says "the output does not change", equivariance says "the output changes in a predictable, identical way". Full invariance appears only after pooling/global pooling or by learning it end-to-end. This lesson covers the math of convolution (convolution vs cross-correlation convention in DL libraries), the parameter count of a filter, the output size formula (out = floor((in + 2p − k)/s) + 1), padding variants (valid, same, full), the role of stride as downsampling, and why convolution learns a hierarchy of features (edges → textures → object parts → objects) documented among others by Zeiler & Fergus 2014.