Neural Networks: From Fundamentals to Modern AI · Sequences: RNN, LSTM and GRU
The sequence problem — why feedforward is not enough
Sequences: RNN, LSTM and GRU
Introduction
A classical feedforward network (MLP, non-recurrent CNN) assumes a fixed-shape input and processes it in a single pass, with no memory between samples. Sequences — text, audio, time series, DNA — break both assumptions: their length is variable, and the meaning of an element at position t depends on the history of positions 1..t-1. Attempts to handle this with a classical network produce three workarounds: (1) a fixed-length window of N (NNLM Bengio et al. 2003 — predicting a word from N-1 previous ones), (2) bag-of-words/averaging which loses order, (3) padding/truncation to maximum length which wastes parameters and does not scale. All three share one core problem: no parameter sharing across time. In a feedforward network the weight for position 3 differs from the weight for position 7, so the model learns each slot separately and loses data on rare events. Recurrent networks (RNN, LSTM, GRU) solve this by parameter sharing across time: the same matrix W applied at every step, with hidden state h_t carrying a summary of the history. The lesson also outlines non-recurrent alternatives: 1D-CNN with dilated convolutions (WaveNet, van den Oord et al. 2016) and Transformer with self-attention (Vaswani et al. 2017) — but to grasp the core sequence problem you must first see why a plain MLP fails.