Neural Networks: From Fundamentals to Modern AI · Sequences: RNN, LSTM and GRU

RNN: hidden-state loop, BPTT and a first-step implementation

Sequences: RNN, LSTM and GRU

Introduction

A recurrent neural network (Elman 1990, Jordan 1986) introduces one extra element into the feedforward setup: a hidden state vector h_t updated recursively. The classical formula is h_t = tanh(W_xh x_t + W_hh h_{t-1} + b_h), with output y_t = W_hy h_t + b_y. This seemingly minimal change has three consequences: (1) parameter sharing across time — the same W_xh, W_hh, W_hy matrices at every step, (2) memory — h_t carries a summary of the history x_1..x_t, (3) sequential computation — h_t depends on h_{t-1}, so the forward pass is inherently sequential and does not parallelize through time (unlike a Transformer). Training works via Backpropagation Through Time (BPTT, Werbos 1990): we unroll the network in time into a computational graph with T steps and propagate the gradient from the last step back through all T-1 earlier ones. Practical implementations use truncated BPTT (k1, k2; Mikolov 2010) — propagating the gradient only k2 steps back to fit thousand-step sequences in memory. The lesson also covers Elman vs Jordan, the char-RNN archetype (Karpathy 2015 "The Unreasonable Effectiveness of Recurrent Neural Networks") and basic variants: many-to-one (classification), synchronous many-to-many (POS tagging), seq2seq (encoder-decoder, Sutskever et al. 2014).