Neural Networks: From Fundamentals to Modern AI · Sequences: RNN, LSTM and GRU

LSTM: gates, cell state and the constant error carousel

Sequences: RNN, LSTM and GRU

Introduction

LSTM (Long Short-Term Memory) is an architecture designed by Sepp Hochreiter and Jürgen Schmidhuber in 1997 as a direct remedy for vanishing gradients in classical RNNs. The brilliant idea: instead of a single hidden state h_t, introduce two parallel signals — a cell state c_t (long-term memory, the "tape") and a hidden state h_t (short-term, output to the next layer). The cell state updates via an additive operation c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t, not via matrix multiplication. This is the heart of LSTM — the so-called constant error carousel (CEC): when the forget gate f_t ≈ 1, c_t ≈ c_{t-1} + new info, so ∂c_t/∂c_{t-1} ≈ 1 and gradients flow backwards without exponential decay. Three gates (sigmoid, output in [0,1]) control the flow: the forget gate f_t decides what to forget from c_{t-1}, the input gate i_t what to add from the candidate g_t = tanh(...), the output gate o_t what to emit as h_t = o_t ⊙ tanh(c_t). The forget gate was added by Gers, Schmidhuber and Cummins in 2000 ("Learning to forget"); the original 1997 paper did not have it. The parameter count is 4× larger than a plain RNN of the same width (4 matrices: for f, i, g, o), which is a meaningful performance difference. LSTM was the standard in NLP and sequences from 2014 (seq2seq, Sutskever et al.) to 2017-2018 (when Transformer dethroned it), but it is still used in domains where sequences are short and resources limited (speech, time series, edge devices).