SVD-decomposed weight matrices
Static knowledge backbone of the model
Base LLM weight matrices decomposed via SVD into U·Σ·Vᵀ. Singular values (Σ) are the application point for expert vectors.
Adapting LLMs to unseen tasks at inference time by selectively adjusting only the singular components of weight matrices, without classical fine-tuning.
1) Offline phase: the base LLM's weight matrices are decomposed via SVD; lightweight expert vectors 'Z' are trained via RL — each vector specialises in a task category (e.g., math, coding, reasoning). 2) Inference phase, pass 1 (dispatch): the system analyses the prompt and identifies the task type. 3) Inference phase, pass 2 (execute): expert vectors matching the task are dynamically mixed and applied to the singular values of the weights, yielding a model tailored to the specific prompt — without updating the original weights.
Classical fine-tuning and PEFT (LoRA) produce static adapters that cannot handle arbitrary unseen tasks at runtime. Transformer² solves this by dynamically composing expert vectors at inference time.
Static knowledge backbone of the model
Base LLM weight matrices decomposed via SVD into U·Σ·Vᵀ. Singular values (Σ) are the application point for expert vectors.
Dynamic behavioural adaptation
Lightweight task-specialised vectors trained via Reinforcement Learning. They modulate the singular values Σ during inference.
Task routing
Lightweight classifier that analyses the prompt in the first pass and selects the appropriate set of expert vectors.
Partially parallel
The second inference pass depends on the first pass (sequential dispatch → execute), but expert execution itself is fully parallel.
Conditional
Input dependent
Conceptually similar to MoE, but routing operates in SVD space rather than across FFN blocks.
Number of expert vectors
Number of trained Z vectors covering different task categories.
SVD decomposition rank
Number of singular values retained during weight decomposition — trades adaptation capacity vs. cost.
RL objective for expert vectors
Reward function used to train Z vectors (typically task-specific reward).
If the dispatcher misidentifies the task type, it will select the wrong expert vectors and quality degrades significantly.
Train the dispatcher on diverse prompts and fall back to full mode (all experts mixed) when classification confidence is low.
RL training of Z vectors can be unstable in the presence of sparse or noisy rewards.
Use reward shaping, KL regularisation against the base policy, and variance-reducing baselines.
GENESIS · Source paper
Transformer-Squared: Self-adaptive LLMsSVD as a tool for analysing neural network weight matrices — theoretical foundation
LoRA (Hu et al.) — low-rank adaptation as an efficient fine-tuning alternative
breakthroughTransformer² (Sakana AI) — first self-adaptive LLM method based on SVD and RL-trained expert vectors
breakthroughBoth SVD decomposition and LLM inference rely on dense matrix operations well supported by tensor cores.
Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks — the experts — along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.
GO TO CONCEPTLoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning (PEFT) technique proposed by Hu et al. (2021). Instead of updating all parameters of a pretrained model, LoRA freezes the original weight matrix W₀ and learns the weight change ΔW as a low-rank decomposition ΔW = BA, where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ min(d, k) is the rank. The adapted weight is W' = W₀ + (α/r)·BA, where α is a scaling hyperparameter. B is initialized to zero and A with random Gaussian values, ensuring the initial adapted output is identical to the pretrained model. During training, W₀ is frozen and only A and B are updated via gradient descent. After training, BA can be merged into W₀ (W = W₀ + BA), eliminating any inference latency relative to the original model. Trainable parameters per adapted layer are r·(d + k) instead of d·k, yielding efficiency gains of d·k / (r·(d+k)) — approximately 100× for typical transformer layers with r=8 and d=k=1024. LoRA was originally applied to query (Wq) and value (Wv) projection matrices in transformer self-attention, though practitioners often apply it to all linear layers for maximum performance. Key PEFT context: LoRA belongs to the reparametrization-based PEFT category, alongside adapters and prefix tuning. Its main advantage over adapter layers is zero additional inference latency after weight merging. Common variants include QLoRA (4-bit quantized base model with LoRA adapters), AdaLoRA (adaptive rank allocation via SVD), DoRA (weight-decomposed adaptation of direction and magnitude), and rsLoRA (rank-stabilized scaling α/√r).
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTReinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Transformer-Squared: Self-adaptive LLMs (arXiv 2501.06252) | arXiv | scientific article |
| Transformer² — official Sakana AI blog post | Sakana AI | blog |
| SakanaAI/self-adaptive-llms (GitHub) | Sakana AI | code |