GLU splits the input into two paths: one passes through an activation function (gate), and the other is element-wise multiplied by the gate output. Variants like SwiGLU (used in LLaMA, Gemini) combine gating with Swish activation.
Standard feed-forward layers in transformers use simple activations (ReLU, GELU). GLU introduces a gating mechanism that selectively passes information, improving model quality and efficiency.
GLU requires two parallel projections instead of one — at constant model size the FFN hidden size must be reduced by ~√2 to maintain the same parameter budget.
The classic GLU with sigmoid gate can suppress gradients in deep networks — hence SwiGLU and GeGLU variants (with SiLU/GELU) are preferred in modern LLMs.
GLU/SwiGLU are matrix operations — fully accelerated by Tensor Cores on NVIDIA GPUs (A100/H100). Fused kernel implementations (e.g. in FlashAttention-2) reduce memory overhead.