Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer

Motivation — RNN limitations and long-range dependencies

The Attention Mechanism and the Transformer

Introduction

Before the Transformer (Vaswani et al. 2017) the standard for sequences was recurrent LSTM (Hochreiter & Schmidhuber 1997) and GRU (Cho et al. 2014) networks, usually in an encoder–decoder architecture with an attention mechanism (Bahdanau et al. 2014, Luong et al. 2015). Despite their successes (NMT, ASR, language modeling) RNNs had three structural limitations. First — sequential training: state h_t depends on h_{t-1}, so no step can be computed in parallel. Consequence: GPUs sit underutilized and training time grows linearly with sequence length n. Second — gradient pathology on long BPTT unrollings (vanishing/exploding, Bengio et al. 1994): the error signal multiplied by n matrices fades exponentially. LSTM mitigates this with gates and a cell-state highway c_t but does not eliminate it. Third — the fixed context vector bottleneck in classical seq2seq: the encoder squeezed the whole sentence into h_n and the decoder had to reconstruct the translation from it. Bahdanau attention 2014 was the workaround: at every decoding step the decoder looks at all encoder states h_1..h_n and takes a weighted combination. That is the birth of attention in deep learning — initially an add-on to RNN, only in 2017 becoming the sole mechanism ("Attention is all you need"). The path length of a signal between two distant tokens in an RNN is O(n); in self-attention it is O(1) — every pair connects in a single step. This is the key intuition for the whole chapter.