Neural Networks: From Fundamentals to Modern AI · Sequences: RNN, LSTM and GRU

GRU: the simplified LSTM variant

Sequences: RNN, LSTM and GRU

Introduction

GRU (Gated Recurrent Unit) is an architecture proposed by Kyunghyun Cho et al. in 2014 ("Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation") as a simplification of LSTM. Core idea: instead of 3 gates + a separate cell state, we have 2 gates (reset r_t and update z_t) and a single hidden state h_t serving as both memory and output. The update gate z_t is a "coupled" forget+input gate: h_t = (1-z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t — when z_t=0, we keep the old state; when z_t=1, we take the new candidate. The reset gate r_t controls how much of the previous state influences the candidate: h̃_t = tanh(W_h·[r_t ⊙ h_{t-1}, x_t] + b_h). For r_t=0 the candidate ignores history (useful for starting a new sequence), for r_t=1 it incorporates it normally. Gain: 3× weight matrices instead of 4× (25% fewer parameters), no separate cell state (less state to propagate), simpler implementation. Cost: no output gate (every aspect of h_t is emitted), no peephole, expressivity limits on some tasks (Weiss, Goldberg, Yahav 2018 showed LSTM > GRU for counting tasks). Empirically Chung et al. 2014 ("Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling") showed LSTM and GRU give comparable results on most tasks — GRU wins on training speed and smaller memory footprint, LSTM wins when the task requires precise emission control (selective output). In practice GRU is today preferred in smaller models (mobile, IoT, streaming), LSTM in larger encoders.