What are weights?
When we say a language model "has 70 billion parameters," we are talking precisely about its weights. A weight is a single number describing the strength of the connection between two artificial neurons?Artificial neuron: The smallest unit of a neural network. It takes numbers as input, multiplies them by weights, sums them, and passes the result through an activation function to produce one output number — a simplified mathematical model of a brain neuron.. Every input to a neuron is multiplied by its corresponding weight — the larger the weight, the greater the influence of that signal on the result.
A weight can be positive or negative. A positive weight amplifies the signal and excites the neuron. A negative weight does the opposite — it dampens or even flips the signal, inhibiting the neuron. So the sign says "which way", and the magnitude — how far from zero, regardless of sign — says "how strongly".
Alongside weights, a network has two more parts worth knowing:
- Bias — a fixed number the neuron adds to the sum of its incoming signals. It is the neuron's starting point: it shifts the whole result up or down before the neuron decides what to pass on. This gives each neuron its own leaning — one reacts to even a faint signal, another stays reserved.
- Activation function — the rule that turns the number a neuron computed into its final output. It does not pass it through one-to-one — it bends it: weak or negative values are suppressed, strong ones pass on. That bending (non-linearity) is the key — without it, even the deepest network would behave like one plain multiplication and could not capture complex patterns.
Together, weights and biases form the full set of a model's parameters.
Hover a neuron or a connection to see details.
Weights, activations and hyperparameters — what's the difference?
Much of the confusion around weights comes from mixing them up with two other concepts. Weights are not the same as activations or hyperparameters. Here is how the three differ:
- Weights — the model's learned knowledge: the strength of connections between neurons. They change only during training and then stay fixed. They are what sits in the model file — weights literally are the "model".
- Activations — momentary values that neurons compute for a specific input?Neuron input: The numbers that come into a neuron — either the model's input data, e.g. an encoded piece of text, or the outputs of neurons in the previous layer. These are what the neuron multiplies by its weights.. They are recreated on every query and stored nowhere — they exist only during computation.
- Hyperparameters — settings of the learning process (learning rate?Learning rate: How big a step the model takes each time it adjusts its weights. Too big and training becomes unstable, too small and it learns painfully slowly., batch size?Batch size: How many examples the model processes at once before it updates its weights one time. A larger batch means steadier but more memory-hungry steps., number of epochs?Number of epochs: How many times the model passes through the entire training dataset. Too few and it underlearns, too many and it starts memorizing the data.) that a human sets by hand before or during training. The model never learns them — they steer learning, they are not its result.
Put simply: weights are the result of learning, hyperparameters are the rules of the game, and activations are the model's momentary thoughts while it works.
Where do weights come from?
Weights are not coded by hand. They emerge from training, a cycle of error correction repeated billions of times. At the start, weights are initialized randomly — they cannot be zeros, because then all neurons would behave identically and the network would have nothing to differentiate.
A four-step loop then repeats:
- Forward propagation — data passes through the network, and the current weights compute a prediction.
- Loss function — compares the prediction with the expected result, for example the correct next word in a sentence.
- Backpropagation — the algorithm computes the gradient, indicating how much each weight contributed to the error.
- Optimizer — nudges every weight a small step toward lower error. It is usually an algorithm called Adam (short for Adaptive Moment Estimation) or stochastic gradient descent.
At the heart of this loop is the loss — a single number saying how badly the model got it wrong. It comes from comparing the prediction with the correct answer we know from the training data (for example, the actual next word). The bigger the gap, the bigger the loss. Loss is the compass: all of training is about shrinking it, and the gradient points in which direction to fix the weights.
There is a catch, though. Loss measures the error only on examples whose answer we already know. Driving it to zero can mean not learning but memorizing the data — the model does great on the training set yet stumbles on new examples. This is called overfitting. So low training loss on its own does not guarantee a good model.
A single correction is microscopic. Only their accumulation across billions of iterations on huge GPU clusters turns random noise into matrices that encode grammar, facts, and reasoning ability. The weights of a finished model are a compressed trace of everything the network "read" in its training data.
Why can't the initial weights be zero?
We mentioned that weights start from small random numbers, not from zero. It sounds like a technical detail, but it is a hard requirement — without it the network does not learn at all. Let us see why, step by step.
Every neuron would compute the same thing
Recall the formula for a single neuron: it takes the numbers coming in, multiplies each by its weight, sums them, adds a bias?Bias: A fixed number the neuron adds to the weighted sum of its inputs. It shifts the result up or down before the activation function, giving the neuron its own leaning independent of the inputs., and passes the result through an activation function.
Symbol meaning
- …
- the numbers entering the neuron
- …
- the weight linking input i to neuron j
- …
- the bias, a fixed offset
- …
- the activation function, e.g. ReLU (zeroes out negative numbers and passes positive ones unchanged)
- …
- the neuron's finished signal
- …
- the index numbering the neurons in a layer
If all weights are zero, the sum inside the brackets vanishes and every neuron — regardless of its number … — produces exactly the same value …. A hundred neurons behave like one. This is the symmetry problem: an identical start means identical neurons.
Randomness breaks the symmetry
The cure is simple: weights are initialized with small random numbers. Then the neurons compute different values from the very first step, the gradients are non-zero, and each neuron learns something different. For example, with weights … and … the sums … and … differ, so … and the symmetry breaks.
That is why, in practice, clever random-initialization methods like Xavier (Glorot) or He are used — they pick the scale of the random weights so that the signal in a deep network neither vanishes nor explodes. The idea, though, stays the same as in our example: the weights must start out different so the network has something to learn from.
What is a weight physically made of?
Each weight is a floating-point number, and how it is stored in memory has enormous practical consequences. This is where the engineering that drives costs begins. The industry uses three main formats today:
- FP32 — the 32-bit standard (4 bytes): 1 bit for the sign (whether the number is positive or negative), 8 bits for the range, i.e. how large or small the number can be (the exponent?Exponent: The part of a floating-point number that sets the order of magnitude — the range: how large or small a number can be stored. More exponent bits means a wider range without overflow.), and 23 bits for precision, i.e. how faithfully it stores the number — the more such bits, the smaller the rounding error (the mantissa?Mantissa: The part of a floating-point number responsible for precision — how faithfully the number is stored. More mantissa bits means a smaller rounding error, i.e. a more accurate number.). Very precise, but memory-hungry.
- FP16 — a 16-bit format (2 bytes). It keeps good precision but has a narrow range?Narrow range: The format covers only a relatively small span of numbers (in FP16 up to about ±65504). Values outside it "do not fit" and turn into infinity — that is overflow. — large numbers do not fit in it. If a value goes beyond that range during training, it turns into "infinity" (Inf), and from that point on the rest of the calculations break and the model stops learning. This is called overflow.
- BF16 (Brain Float 16) — also 16-bit, but it splits the bits differently: more for range (as many as FP32) and fewer for precision, i.e. how faithfully it represents a number. So it covers the same huge range as FP32, but rounds numbers a bit more aggressively.
Neural networks tolerate this small rounding noise well, which is why BF16 has become the default format for training large models.
How much do weights weigh?
Since every weight is a number of a given precision, the size of an entire model can be estimated by simple multiplication: parameter count times bytes per parameter. Hence a practical rule: in a 16-bit format one billion parameters takes about 2 GB, and in 8-bit about 1 GB.
| Precision | Bytes/parameter | 7B model | 70B model |
|---|---|---|---|
| FP32 | 4 B | ~28 GB | ~280 GB |
| FP16 / BF16 | 2 B | ~14 GB | ~140 GB |
| INT8 | 1 B | ~7 GB | ~70 GB |
| INT4 | ~0.5 B | ~3.5–4 GB | ~35–40 GB |
These figures, however, cover the weights at rest alone. In practice, inference?Inference: Inference is using a finished model to produce answers — the model’s normal work after training (e.g. you ask a question and get a result). Unlike training, the model no longer learns, it just computes. needs extra memory for the context buffer?Context buffer: A cache where the model keeps the already-processed part of the conversation or text (the context), so it does not recompute everything for each new token. It grows with the context length. (the KV cache), and full training consumes many times more. Alongside the weights, it must also keep gradients in memory — hints telling it how much to adjust each weight — and optimizer states, extra helper numbers that smooth out those adjustments. That is why the same model in training can demand several times more VRAM?VRAM: The memory of a graphics card (GPU). The model’s weights and all data needed for computation must fit in it — its size decides whether a model can run at all. than it needs just to run.
Do all weights work at once? Sparsity and MoE
So far we have assumed that every weight takes part in every computation. That is how a dense model works — the default architecture, where the full set of parameters fires for every token.
One approach breaks that rule: Mixture of Experts (MoE). The network is split into many "experts" — disjoint subsets of weights — and a small router decides which few to use for a given token. The rest stay idle. This is a form of sparsity: for a specific input, only a fraction of all weights are active.
Total versus active parameters
In a sparse model such as MoE, two numbers that meant the same thing in a regular dense model now drift apart:
- Total parameters — all of the model's weights. This is what must be loaded into memory.
- Active parameters — the ones that actually compute for a single token. This is what sets the cost and speed of inference?Inference: Inference is using a finished model to produce answers — the model’s normal work after training (e.g. you ask a question and get a result). Unlike training, the model no longer learns, it just computes..
In practice the gap can be huge:
- Mixtral 8×7B — about 47 billion weights in total, but only around 13 billion active per token.
- DeepSeek-V3 — 671 billion total, around 37 billion active.
For memory usage this means something important: memory always holds all the weights, because the whole model must be loaded, so an MoE model "weighs" as much as its total parameter count. But it only computes with its active slice, so it runs as fast as a model of that size — its speed is set by the number of active parameters, not the total. That gives cheaper, faster per-token inference. That is why "70 billion parameters" can mean two very different things depending on whether the model is dense or sparse.
Who builds the ecosystem around weights?
Although weights are an abstract mathematical concept, the entire ecosystem of tools for storing and adapting them is built by specific companies and teams. The Safetensors format, today's industrial standard for safely storing raw weights, was developed by Hugging Face. It emerged as a response to a real threat — PyTorch's older format, based on the pickle library, allowed arbitrary code to be hidden in a model file and executed on load. Safetensors stores only numbers, with no way to inject logic.
On the local-inference side, the llama.cpp project by Georgi Gerganov played a key role, together with the GGUF format, which packs quantized weights and configuration into a single self-contained file. The LoRA technique, discussed shortly, was developed by researchers at Microsoft. Finally, the open-weights philosophy is driven today by companies such as Meta (LLaMA models), Mistral AI, DeepSeek, and Qwen, which publish their parameters openly.
How are weights squeezed onto a laptop?
The most important technique of recent years is quantization — mapping high-precision weights onto low-bit integer formats:
- INT8 — a weight stored in 8 bits. A practical rule of thumb: in INT8, one billion parameters is roughly one gigabyte of memory.
- INT4 — a weight in just four bits, i.e. four times less space than FP16.
The savings are dramatic. As the table above shows, the same 70-billion-parameter model drops from about 280 GB in FP32 to just 35–40 GB in INT4. It is thanks to 4-bit quantization that a model like LLaMA 3 70B can run on a laptop with 64 GB of shared memory. Quantizing to 4 bits typically preserves around 95 percent of a model's original capability while shrinking its size fourfold.
Why change weights after training — LoRA and fine-tuning
A finished base model has general knowledge, but it often needs to be tuned to a specific task, tone, or domain. Classic full fine-tuning updates every one of the billions of weights, which is so costly that only large centers with accelerator clusters can afford it.
LoRA — the cheaper path
The breakthrough was LoRA (Low-Rank Adaptation). Its authors noticed that the weight changes needed to adapt a model can be reconstructed from two much smaller matrices instead of changing all the weights. LoRA fine-tunes an already finished, trained model that holds all the learned knowledge. It does not touch the model’s weights, it "freezes" them, and instead of retraining the whole thing it trains only small matrices attached next to selected layers of the model. When the model runs, their output is added to the output of the frozen layer — the original weights stay untouched, yet the model’s behavior changes. It is not a separate model, just a small "overlay" on the existing one. That brings three concrete benefits:
- Fewer trained parameters — their number drops tens of thousands of times.
- Cheaper hardware — a model can be tuned on a single consumer GPU in a few hours.
- Modularity — one frozen base model works with many small adapters switched depending on the task.
An extension of this idea is QLoRA, which combines LoRA with 4-bit quantization of the base model, allowing truly large models to be tuned on hardware that previously could not even hold them.
Open weights versus closed models
How weights are distributed splits today's market into two camps.
Closed (proprietary) models
Models such as OpenAI's GPT, Anthropic's Claude, or Google's Gemini expose only an API — the weights themselves remain a company secret. The user sends data to someone else's servers and receives a result, changing the model's behavior mainly through prompts.
Open-weights models
Open-weights models — LLaMA, Mistral, DeepSeek, Qwen — publish their finished parameters under permissive licenses. It is worth stressing, however, that "open weights" is not the same as "open source": full open source would also require releasing the training data and code, which most open models do not do.
| Criterion | Closed models (API) | Open weights |
|---|---|---|
| Access to weights | no | yes |
| Where data lives | on someone else's servers | locally, privately |
| Tuning | mostly prompts | full (LoRA, fine-tuning) |
| Vendor lock-in | high | low |
Why does it matter?
Weights have stopped being an internal technical detail and become a central object of strategy across the AI industry. Once you understand that all of a model's "intelligence" fits in a file of numbers, you see every decision around it differently. It comes down to three things:
- Memory and cost — the question "how many parameters does it have" translates directly into how much memory and power a deployment will consume.
- Where you can run it — the choice of precision and quantization decides whether a model fits on an edge device or demands a server room.
- Control — whether you have access to the weights or only to an API marks the line between renting someone else's intelligence and owning your own.
Open weights, quantization, and techniques like LoRA together democratize AI — shifting the ability to build and tune models from a handful of giants into the hands of individual teams and developers. Data from 2024–2026 shows the quality gap between the best closed and open models shrinking fast, with open models already winning on some tasks.
Weights are not magic, just the compressed memory of learning written in numbers. The better you understand how they are formed, stored, and modified, the more deliberately you can choose your AI tools — instead of treating them as an impenetrable black box.
