
Deep LearningIntermediate
Neural Networks: From Fundamentals to Modern AI
The course covers the full scope of neural networks โ from mathematical foundations (linear algebra, calculus, statistics), through the backpropagation mechanism, to modern deep learning architectures used in industry and research. Participants study fully connected networks (MLP), convolutional networks (CNN), recurrent networks (RNN, LSTM, GRU), attention mechanisms, and the fundamentals of transformers. All material is grounded in the PyTorch ecosystem โ every implementation is coded from scratch and then refactored to idiomatic framework code. Prerequisites: Python scripting and basic NumPy; no prior ML library experience or advanced mathematics required (all necessary concepts are introduced in-course). Not covered: large language models (LLM), diffusion models, reinforcement learning, production deployment (MLOps), or advanced regularization beyond the practical level. Graduates are ready to independently design deep network experiments, interpret training results, and join PyTorch-based projects without senior support.
Chapters
MODULE 01What is a neural network โ your mental model of AI
A beginner-friendly introductory chapter: what AI, ML and deep learning are, how an artificial neural network works, the three learning paradigms, and the lifecycle of an ML project. No code, no formulas โ only intuition and everyday-life analogies.
What is a neural network โ your mental model of AI
MODULE 02Math and tools: tensors, gradients, Python, NumPy
Mathematical foundation before PyTorch: scalar, vector, matrix and tensor with geometric intuition, tensor operations, derivative and chain rule, gradient of a multi-variable function, gradient descent on a simple 1D function, and Python + NumPy as a bridge to PyTorch. No epsilon-delta โ only intuition, directions, and arrows on the loss map.
Math and tools: tensors, gradients, Python, NumPy
- 2.1Scalar, vector, matrix, tensor โ geometric intuition
- 2.2Tensor operations: addition, multiplication, matrix multiplication
- 2.3Derivative and chain rule โ intuition of the direction of fastest growth
- 2.4Gradient of a multi-variable function โ an arrow on the loss map
- 2.5Gradient descent on a simple function โ walking downhill step by step
- 2.6Python, NumPy and the first tensor โ a bridge to PyTorch
MODULE 03Your First End-to-End Training โ From Data to Prediction
Your first working classifier: how data becomes a prediction. You learn the dataset, the loss, the training loop (forward, loss, gradient, update), evaluation, and code an XOR classifier in pure NumPy.
Your First End-to-End Training โ From Data to Prediction
MODULE 04PyTorch Environment and Tensor Foundations
PyTorch fundamentals: tensors and their operations, autograd and the computational graph, layers via nn.Module, and a full training cycle with metrics and GPU usage.
PyTorch Environment and Tensor Foundations
MODULE 05From Neuron to MLP: Architecture and Forward Pass
From a single perceptron to a multilayer MLP: activation functions (sigmoid, ReLU, GELU, tanh), the Universal Approximation Theorem, forward pass mechanics, loss functions (MSE and Cross-Entropy), and implementing a 2-layer network from scratch in pure NumPy.
From Neuron to MLP: Architecture and Forward Pass
- 5.1Perceptron: input, weight, bias, activation
- 5.2Activation functions: sigmoid, ReLU, GELU, tanh โ when and why
- 5.3The Universal Approximation Theorem โ why non-linearity is necessary
- 5.4Multilayer network (MLP) and the forward pass step by step
- 5.5Loss functions: MSE and Cross-Entropy โ intuition and choice
- 5.6Implementing a 2-layer MLP from scratch (no autograd, pure NumPy)
MODULE 06Backpropagation: How a Network Learns
The backpropagation algorithm from its mathematical foundation to practical implementation: the chain rule as the core of backprop, symmetry between forward and backward pass, building a Karpathy-style micrograd autograd, hand-deriving gradients through cross-entropy, a linear layer and tanh, and the impact of Xavier and He initialization on healthy gradient flow.
Backpropagation: How a Network Learns
- 6.1Chain rule โ the foundation of backpropagation
- 6.2Forward pass vs backward pass โ symmetry and gradient flow
- 6.3Building micrograd: Value, backward(), graph visualization (Karpathy)
- 6.4Backprop Ninja: manual backward through cross-entropy, linear, tanh and batch-norm
- 6.5Weight initialization: Xavier and He โ how the start decides gradient flow
MODULE 07Training in practice: optimizers and diagnostics
The practical side of training neural networks: geometry of the loss landscape and mini-batch SGD, momentum and Adam as a family of adaptive optimizers, learning rate schedules (step decay, cosine annealing, warmup), systematic training diagnostics (overfit a single batch, sanity-check loss at init), gradient histograms, the dead neurons problem and gradient clipping, and the classical bias-variance tradeoff as a framework for diagnosing underfitting and overfitting.
Training in practice: optimizers and diagnostics
- 7.1Gradient descent geometrically: loss surface, learning rate and mini-batch SGD
- 7.2Momentum and Adam: adaptive learning rates and when to use them
- 7.3LR schedules: step decay, cosine annealing, warmup
- 7.4Systematic diagnostics: overfit single batch, init loss, learning curves
- 7.5Gradient histograms, dead neurons and gradient clipping
- 7.6Bias-variance tradeoff and diagnosing underfitting vs overfitting
MODULE 08Regularization โ how to avoid overfitting
Regularization as a set of techniques that preserve model generalization: dropout as stochastic neuron suppression with different behavior in train vs eval mode, weight decay and L2 as a penalty for large weights, batch normalization addressing internal covariate shift, layer normalization as an alternative for small batches and variable-length sequences, and early stopping together with systematic training monitoring (loss curves, train/val splits, stopping criteria).
Regularization โ how to avoid overfitting
- 8.1Dropout: mechanism, train vs eval mode, implementation
- 8.2Weight decay and L2 regularization โ penalizing large weights
- 8.3Batch Normalization: the internal covariate shift problem and its solution
- 8.4Layer Normalization: when BatchNorm fails and how to replace it
- 8.5Early stopping and training monitoring strategies
MODULE 09Convolutional Neural Networks (CNN)
Convolutional networks as the foundation of modern computer vision: 2D convolution with a filter as a feature detector, the role of padding, stride, and translation equivariance; pooling and the flow of spatial dimensions through successive layers; the evolution of architectures from AlexNet through VGG to ResNet and the answer to what changed and why; skip connections and residual blocks that solve the degradation problem in very deep networks (He et al. 2015); transfer learning as feature extraction and fine-tuning of pretrained models.
Convolutional Neural Networks (CNN)
- 9.12D convolution: filter as feature detector, padding, stride, equivariance
- 9.2Pooling, feature maps, and dimension flow through the network
- 9.3Architecture evolution: AlexNet โ VGG โ ResNet โ what changed and why
- 9.4Skip connections and residual blocks โ solving the degradation problem (He 2015)
- 9.5Transfer learning โ feature extraction vs fine-tuning (how to leverage ImageNet)
MODULE 10Interpretation and Visualization of Neural Networks
How to open the black box of a deep network: visualization of learned filters and activation maps in a CNN (Zeiler & Fergus 2014); GradCAM as a gradient-weighted class saliency map (Selvaraju et al. 2017); adversarial examples and FGSM as a proof of decision fragility (Goodfellow et al. 2015); model profiling โ parameter count, FLOPs, inference latency as concrete computational cost metrics.
Interpretation and Visualization of Neural Networks
MODULE 11Sequences: RNN, LSTM and GRU
Why feedforward networks are insufficient for sequential data and how recurrence solves this problem. The classical RNN and its training via BPTT (backpropagation through time, Werbos 1990). Gradient pathology in deep time unrollings โ vanishing and exploding (Bengio et al. 1994). LSTM as the answer to vanishing gradients with forget, input and output gates (Hochreiter & Schmidhuber 1997). GRU as a simplified LSTM alternative with fewer gates (Cho et al. 2014).
Sequences: RNN, LSTM and GRU
MODULE 12The Attention Mechanism and the Transformer
The attention mechanism is the invention that replaced recurrence as the foundation of sequence modeling and gave rise to the Transformer architecture (Vaswani et al. 2017). The chapter covers the motivation โ RNN limitations on long-range dependencies (vanishing gradients, lack of parallelism) โ followed by scaled dot-product attention with the Query/Key/Value triple, multi-head attention and positional encoding, the full encoder block (FFN, residual, Layer Norm), BPE tokenization, and a mini-Transformer implementation from scratch in PyTorch.
The Attention Mechanism and the Transformer
- 12.1Motivation โ RNN limitations and long-range dependencies
- 12.2Self-attention โ Query, Key, Value and scaled dot-product attention
- 12.3Multi-head attention and positional encoding
- 12.4Transformer architecture โ encoder block, FFN, LayerNorm, residual
- 12.5Tokenization and BPE โ why text is neither characters nor words
- 12.6Implementing a mini-Transformer from scratch in PyTorch