Robots Atlas>ROBOTS ATLAS
Architecture

Recurrent Neural Network

Historical
Category
Architecture
Abstraction level
Primitive
Operation level
ModelInferenceTraining
Use cases
Natural-language modeling (pre-Transformer era)Machine translation (seq2seq architecture)Speech recognition (ASR)Time-series analysis and forecastingText and music generationSequence classification (sentiment, intent)

How it works

1. At each time step t the network receives two inputs: the current vector x_t and the previous hidden state h_{t-1}. 2. Computes the new state: h_t = tanh(W_xh ยท x_t + W_hh ยท h_{t-1} + b_h). The same weight matrices W_xh, W_hh are shared across all time steps. 3. Optional per-step output: y_t = W_hy ยท h_t + b_y (e.g. next-token probability). 4. Training via BPTT (backpropagation through time): the network is unrolled for T steps and gradients are propagated backward. The loss gradient w.r.t. parameters is summed across all steps. 5. Truncated BPTT: propagation horizon limited to k steps to reduce cost. 6. Vanishing-gradient problem: over long sequences, gradients are multiplied T times by W_hh โ†’ exponential decay or explosion โ†’ inability to learn long-range dependencies.

Problem solved

Standard neural networks (MLP, CNN) process fixed-size inputs and have no built-in memory โ€” they cannot model dependencies between elements of a sequence (e.g. words in a sentence, audio samples). RNN solves this by introducing a hidden state carried across time steps, enabling variable-length sequence modelling and temporal dependencies.

Key mechanisms

Hidden state carried across time steps
Weights shared across time โ€” the same matrix at every step
Update: h_t = tanh(W_xh ยท x_t + W_hh ยท h_{t-1} + b)
Backpropagation through time (BPTT) โ€” unrolling the network in time
Truncated BPTT โ€” limiting propagation horizon to a window
Topology variants: one-to-many, many-to-one, many-to-many, encoder-decoder
Gradient clipping as the standard training-stabilization technique

Strengths & limitations

Strengths
โœ“Natural handling of variable-length sequences
โœ“Linear complexity in sequence length (vs quadratic in Transformers)
โœ“Low memory footprint in online inference
โœ“Few parameters thanks to weight sharing
โœ“Streaming-friendly โ€” one token at a time, no context buffer
โœ“Well suited to edge devices and low-latency tasks
Limitations
โœ—Vanishing-gradient problem โ€” hard to learn long-term dependencies
โœ—Exploding-gradient problem โ€” requires clipping
โœ—Sequential training โ€” hard to parallelize across sequence length
โœ—Short effective memory (a few dozen steps) in the vanilla variant
โœ—Poor scaling to very large models vs the Transformer
โœ—Weaker than Transformers on long text contexts

Implementation

Implementation pitfalls
Vanishing and exploding gradientsMedium

During BPTT gradients are multiplied by the weight matrix at each step โ€” for long sequences they vanish or explode. Gradient clipping mitigates explosion, but vanishing requires LSTM/GRU.

Sequential processing prevents full parallelismMedium

RNN must process token T before T+1 โ€” no possibility of parallelization along the time axis. For long sequences this is the main training speed bottleneck.

Evolution

Original paper ยท 1990 ยท Jeffrey L. Elman
Finding Structure in Time
Jeffrey L. Elman
1982
John Hopfield introduces Hopfield networks โ€” early recurrent neural networks with a dynamic state.
1986
Rumelhart, Hinton and Williams describe backpropagation through time (BPTT) for recurrent networks.
1990
Elman publishes "Finding Structure in Time" โ€” the first fully described Simple Recurrent Network (SRN).
1994
Bengio, Simard and Frasconi formalize the vanishing-gradient problem in RNNs.
1997
Hochreiter and Schmidhuber publish LSTM โ€” a gated variant that addresses the vanishing-gradient problem.
2014
Sutskever, Vinyals and Le introduce RNN/LSTM-based seq2seq โ€” a breakthrough in machine translation.
2017
Vaswani et al. publish the Transformer ("Attention Is All You Need"), which progressively replaces RNNs in NLP.

Computational complexity

Computational characteristics
โ†’Time complexity: O(T ยท dยฒ) where T is sequence length and d the state dimension
โ†’Training memory: O(T ยท d) for BPTT (state storage)
โ†’Poor parallelism across sequence length โ€” strongly sequential
โ†’Good parallelism across the batch
โ†’Streaming inference: O(1) memory per token, constant per-step cost
โ†’Few parameters compared with a Transformer of similar capacity
Benchmark notes

On Penn Treebank vanilla RNN reached ~120 perplexity and LSTM ~80, showing the superiority of gated variants. In WMT 2014 machine translation RNN/LSTM seq2seq models reached ~30 BLEU, trailing Transformers by a few points. RNNs remain a standard in on-device speech models and in some time-series forecasting systems.

Hardware requirements

RNN matrix operations (Wxh, Whh) are accelerated by CUBLAS. cuDNN fused kernel for RNN/LSTM/GRU gives 5-10ร— speedup vs naive PyTorch implementation.