Robots Atlas>ROBOTS ATLAS

โ† Courses

Transformer od zera Logo

AI ArchitectureAdvanced

Transformer from Scratch

10 Chapters40 Lessons

This course will guide learners through a practical Transformer implementation in PyTorch: from token representations and attention to a full trainable model. Chapters and lessons will be added in later stages.

Chapters

MODULE 01

Transformer Foundations

0 / 4 ยท 0%

You will learn why the Transformer architecture was created, the core concepts of sequences and tokens, and the differences between encoders, decoders and decoder-only models.

  1. 1.1Why the Transformer Was Created
  2. 1.2Sequences, Tokens and Representations
  3. 1.3Encoder, Decoder and Decoder-Only Models
  4. 1.4Data Flow Through a Transformer
MODULE 02

PyTorch for Sequence Models

0 / 4 ยท 0%

You will learn the practical PyTorch foundations needed to implement a Transformer: tensor shapes, broadcasting, axis transforms, modules, masks, padding and GPU execution.

  1. 2.13D Tensors: Batch, Sequence, Features
  2. 2.2Broadcasting, Reshape, Transpose and View
  3. 2.3`nn.Module` and Building Model Blocks
  4. 2.4Masks, Padding and GPU Operations
MODULE 03

Self-Attention from Scratch

0 / 4 ยท 0%

You will build intuition for self-attention, learn the roles of Query, Key and Value, derive scaled dot-product attention, and prepare to implement a single attention head in PyTorch.

  1. 3.1Intuition Behind Attention
  2. 3.2Query, Key, Value
  3. 3.3Scaled Dot-Product Attention
  4. 3.4Implementing a Single Attention Head
MODULE 04

Multi-Head Attention

0 / 4 ยท 0%

You will learn why the Transformer uses multiple attention heads, how Q, K and V projections work across heads, how to merge head outputs, and how to build a MultiHeadAttention module in PyTorch.

  1. 4.1Why Multiple Attention Heads Matter
  2. 4.2Linear Projections for Q, K and V
  3. 4.3Merging Heads and Output Projection
  4. 4.4Implementing `MultiHeadAttention` in PyTorch
MODULE 05

Transformer Block

0 / 4 ยท 0%

You will assemble a complete Transformer block from residual connections, LayerNorm, the feed-forward network and attention into a stable implementation pattern.

  1. 5.1Residual Connections
  2. 5.2LayerNorm: Pre-Norm and Post-Norm
  3. 5.3Feed Forward Network
  4. 5.4Complete Transformer Block
MODULE 06

Embeddings and Token Position

0 / 4 ยท 0%

You will learn how token IDs become vectors, how position information is added, and which masks are needed for padding and autoregressive decoding.

  1. 6.1Token Embeddings
  2. 6.2Sinusoidal Positional Encoding
  3. 6.3Learned Positional Embeddings
  4. 6.4Padding Mask and Causal Mask
MODULE 07

Decoder-Only Transformer

0 / 4 ยท 0%

You will assemble a mini-GPT from embeddings, a decoder-only block stack, a language modeling head, and a full forward pass returning logits and loss.

  1. 7.1Mini-GPT Architecture
  2. 7.2Stacking Transformer Blocks
  3. 7.3Language Modeling Head and Logits
  4. 7.4Full Model Forward Pass
MODULE 08

Training a Language Model

0 / 4 ยท 0%

You will move from preparing training sequences through the loss function and PyTorch training loop to validation, checkpoints, and core language-model metrics.

  1. 8.1Training Data and Sequence Batching
  2. 8.2Cross-Entropy Loss for Next-Token Prediction
  3. 8.3Training Loop in PyTorch
  4. 8.4Validation, Checkpoints and Metrics
MODULE 09

Text Generation

0 / 4 ยท 0%

This chapter shows how to run a language model in generation mode: from the autoregressive loop through sampling, KV cache, and quality debugging.

  1. 9.1Autoregressive Token Generation
  2. 9.2Temperature, Top-K and Top-P Sampling
  3. 9.3KV Cache: Intuition and Implementation
  4. 9.4Debugging Generation Quality
MODULE 10

Optimizations and Modern Variants

0 / 4 ยท 0%

This chapter covers practical extensions of the classic Transformer: RoPE, FlashAttention, MQA/GQA, and techniques for fine-tuning and scaling modern models.

  1. 10.1RoPE Instead of Classic Positional Embeddings
  2. 10.2FlashAttention and Attention Performance
  3. 10.3MQA, GQA and Lower Inference Cost
  4. 10.4LoRA, MoE and Future Directions