Architecture

GLU

2016Updated: 4 May 2026

Key innovation

Replaces the standard Transformer FFN layer with a gated product of two linear projections, increasing modeling capacity without adding parameters.

How it works

GLU splits the input into two paths: one passes through an activation function (gate), and the other is element-wise multiplied by the gate output. Variants like SwiGLU (used in LLaMA, Gemini) combine gating with Swish activation.

Problem solved

Standard feed-forward layers in transformers use simple activations (ReLU, GELU). GLU introduces a gating mechanism that selectively passes information, improving model quality and efficiency.

Implementation

Implementation pitfalls

Doubling of projection layer parametersMedium

GLU requires two parallel projections instead of one — at constant model size the FFN hidden size must be reduced by ~√2 to maintain the same parameter budget.

Vanishing gradients with sigmoidal gatingMedium

The classic GLU with sigmoid gate can suppress gradients in deep networks — hence SwiGLU and GeGLU variants (with SiLU/GELU) are preferred in modern LLMs.

Sources

Language Modeling with Gated Convolutional Networks (Dauphin et al., 2016)

GLU Variants Improve Transformer (Noam Shazeer, 2020)