Transformer-Squared

Adapting LLMs to unseen tasks at inference time by selectively adjusting only the singular components of weight matrices, without classical fine-tuning.

SVD-decomposed weight matrices

Static knowledge backbone of the model

Base LLM weight matrices decomposed via SVD into U·Σ·Vᵀ. Singular values (Σ) are the application point for expert vectors.

Expert vectors (Z)

Dynamic behavioural adaptation

Modular

Lightweight task-specialised vectors trained via Reinforcement Learning. They modulate the singular values Σ during inference.

Dispatch system

Task routing

Modular

Lightweight classifier that analyses the prompt in the first pass and selects the appropriate set of expert vectors.

Parallelism

Partially parallel

The second inference pass depends on the first pass (sequential dispatch → execute), but expert execution itself is fully parallel.

Paradigm

Conditional

Input dependent

Conceptually similar to MoE, but routing operates in SVD space rather than across FFN blocks.

Number of expert vectors

Standard

Number of trained Z vectors covering different task categories.

SVD decomposition rank

Standard

Number of singular values retained during weight decomposition — trades adaptation capacity vs. cost.

RL objective for expert vectors

Critical

Reward function used to train Z vectors (typically task-specific reward).

Common pitfalls

Inaccurate task classification in the dispatch phase

HIGH

If the dispatcher misidentifies the task type, it will select the wrong expert vectors and quality degrades significantly.

Train the dispatcher on diverse prompts and fall back to full mode (all experts mixed) when classification confidence is low.

Instability of RL-based expert vector training

MEDIUM

RL training of Z vectors can be unstable in the presence of sparse or noisy rewards.

Use reward shaping, KL regularisation against the base policy, and variance-reducing baselines.

Reference implementations

SakanaAI/self-adaptive-llms (official GitHub)official

Python · Sakana AI

GENESIS · Source paper

Transformer-Squared: Self-adaptive LLMs

2025arXiv preprint 2501.06252 (Sakana AI, 2025)Qi Sun, Edoardo Cetin, Yujin Tang

2014

SVD as a tool for analysing neural network weight matrices — theoretical foundation

2021

LoRA (Hu et al.) — low-rank adaptation as an efficient fine-tuning alternative

breakthrough

2025

Transformer² (Sakana AI) — first self-adaptive LLM method based on SVD and RL-trained expert vectors

breakthrough

GPU Tensor CoresPRIMARY

Both SVD decomposition and LLM inference rely on dense matrix operations well supported by tensor cores.

EXTENDS

MoE

Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks — the experts — along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.

GO TO CONCEPT

ALTERNATIVE TO

PEFT / LoRA

LoRA (Low-Rank Adaptation of Large Language Models) is a parameter-efficient fine-tuning (PEFT) technique proposed by Hu et al. (2021). Instead of updating all parameters of a pretrained model, LoRA freezes the original weight matrix W₀ and learns the weight change ΔW as a low-rank decomposition ΔW = BA, where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r ≪ min(d, k) is the rank. The adapted weight is W' = W₀ + (α/r)·BA, where α is a scaling hyperparameter. B is initialized to zero and A with random Gaussian values, ensuring the initial adapted output is identical to the pretrained model. During training, W₀ is frozen and only A and B are updated via gradient descent. After training, BA can be merged into W₀ (W = W₀ + BA), eliminating any inference latency relative to the original model. Trainable parameters per adapted layer are r·(d + k) instead of d·k, yielding efficiency gains of d·k / (r·(d+k)) — approximately 100× for typical transformer layers with r=8 and d=k=1024. LoRA was originally applied to query (Wq) and value (Wv) projection matrices in transformer self-attention, though practitioners often apply it to all linear layers for maximum performance. Key PEFT context: LoRA belongs to the reparametrization-based PEFT category, alongside adapters and prefix tuning. Its main advantage over adapter layers is zero additional inference latency after weight merging. Common variants include QLoRA (4-bit quantized base model with LoRA adapters), AdaLoRA (adaptive rank allocation via SVD), DoRA (weight-decomposed adaptation of direction and magnitude), and rsLoRA (rank-stabilized scaling α/√r).

GO TO CONCEPT

Commonly used with

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

RLHF

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy π_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_φ(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log σ(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy π_θ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_φ, subject to a KL divergence penalty that prevents the policy from drifting too far from π_SFT: Objective(x, y) = r_φ(x, y) − β · KL(π_θ(y|x) || π_SFT(y|x)). The KL penalty with coefficient β is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (π_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections — generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.

GO TO CONCEPT

Title	Publisher	Type
Transformer-Squared: Self-adaptive LLMs (arXiv 2501.06252)	arXiv	scientific article
Transformer² — official Sakana AI blog post	Sakana AI	blog
SakanaAI/self-adaptive-llms (GitHub)	Sakana AI	code

Transformer-Squared: Self-adaptive LLMs (arXiv 2501.06252)

scientific articlearXiv

Transformer² — official Sakana AI blog post

blogSakana AI

SakanaAI/self-adaptive-llms (GitHub)

codeSakana AI

Back to technology catalog

Transformer-Squared

Use cases

How it works

Problem solved

Main components

SVD-decomposed weight matrices

Expert vectors (Z)

Dispatch system

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

EXTENDS

ALTERNATIVE TO

Commonly used with

Sources