What are pre-training and fine-tuning?
Pre-training is the phase in which a fresh neural network — almost always a Transformer — starts with random weights and is trained on a massive raw corpus: Common Crawl, Wikipedia, books, code. Its job is not to solve a specific task. It learns the probability distribution of which token follows which. The output is a base model: enormously erudite?erudite: Having or showing extensive, in-depth knowledge across many fields; deeply learned and well-read. but useless as an assistant. Asked "How do I make pancakes?" it might respond "How do I make waffles?" because, statistically, internet questions cluster.
Fine-tuning takes that base model and reshapes its behaviour. It teaches the model to follow instructions, format output, hold a tone, refuse harmful requests. The scale is completely different: tens to hundreds of thousands of carefully curated examples instead of trillions of tokens; single GPUs running for hours instead of thousands of GPUs running for months.
The classic two-step "pre-training → fine-tuning" picture has fallen apart in the last two years. Frontier labs now talk about four phases: pre-training, mid-training, SFT (supervised fine-tuning), and alignment. Each has a different goal, different data, and a different budget.
Who is behind it?
Pre-training is the gateway to the AI premier league and the capital barrier that defines it. Only a handful of labs realistically train models from scratch: OpenAI, Google DeepMind, Anthropic, Meta, Mistral, DeepSeek, xAI, Qwen (Alibaba). Meta's Llama 3.1 405B took roughly 2–3 months on a cluster of up to 16,000 NVIDIA H100 GPUs, with GPU-hour costs estimated at $92–123M — and that does not include the price of the hardware itself, which for 24,000 H100s reaches hundreds of millions of dollars in capex.
Fine-tuning, by contrast, is democratic. The Hugging Face community runs it. Startups run it. Corporate product teams run it. Independent researchers run it. Thanks to LoRA, a single engineer with a single A100 can adapt a 70B model to their domain. LoRA, popularised by Microsoft in 2021, and the broader PEFT (Parameter-Efficient Fine-Tuning) family, turned fine-tuning from a luxury into a daily tool.
How does it work?
Pre-training
Pre-training uses self-supervised learning — the labels come from the structure of the text itself, not from humans. Two dominant mechanisms exist:
- Next-token prediction (Causal LM) — the model reads text left to right and, over and over, guesses just one thing: the next word, based on what it has read so far. Each time it compares its guess with the true token — the word that actually appears at that spot. Crucially, the correction does not happen after every word: the model first predicts a whole batch of text (thousands of positions at once) with its weights frozen, then all the misses are summed into a single “loss”, and in one move the backpropagation algorithm shifts the parameters so it aims better next time. The next batch starts with the updated weights — and so it goes, round after round. No human writes the labels — the text itself provides them, so training runs over trillions of words, and that is how the model soaks up grammar, facts and style. This mechanism powers the entire GPT, Llama, Mistral, and Claude family.
- Masked Language Modeling (MLM) — it differs from next-token in one thing: the direction it may look. Next-token sees only what came before and guesses the next word. MLM takes a whole, finished sentence and hides (masks) a random word in it — e.g. “The cat drinks ▢ from the bowl” — and the model’s job is to recover what sits under the gap. This time it can read context from both sides at once: what comes before the gap and what comes after (“The cat drinks” and “from the bowl” together clearly point to “milk”). This two-sided view gives a deeper grasp of meaning, but it removes the ability to write text left-to-right — since the model can already see the rest of the sentence, it never learns to predict it. This is the approach of BERT (2018) and its descendants: excellent where whole-sentence understanding matters — classification, search, meaning analysis — and weaker at generating long passages, where next-token rules.
Fine-tuning (SFT)
Fine-tuning is mostly supervised learning — learning from ready answers prepared by humans. After pre-training the model knows language and facts, but it behaves like someone who can only continue text, not answer instructions. SFT (Supervised Fine-Tuning) fixes that: we show the model thousands of examples shaped as [instruction] → [target answer] pairs — e.g. “Summarise this text” next to a ready, good summary written by a human. The model learns to reproduce that answer, and doing this across many different tasks (this is instruction tuning) it picks up the general skill of following instructions. The mechanism is the same as in pre-training — the model still predicts the next tokens — only the material changes: instead of random text from the internet, carefully chosen instruction–answer pairs. That is why SFT teaches mainly form and behaviour: how to answer, in what tone, in what format. We are not loading in new facts here — those come from pre-training.
Alignment
After SFT comes alignment — tuning the model toward what people consider a good answer. SFT taught the model to follow instructions, but not which of several valid answers is the best one: the most helpful, safe, and appropriately toned. That is what alignment handles.
The classic approach is RLHF (Reinforcement Learning from Human Feedback), made famous by OpenAI in InstructGPT and ChatGPT. It works in three steps. First the model generates several answers to the same prompt, and human annotators rank them from best to worst. Then those rankings are used to train a separate model — reward model — an automatic “judge” that learns to score any answer the way a human would. Finally the model itself (the LLM) is tuned with reinforcement learning (the PPO algorithm) so that its answers earn the highest possible scores from that judge.
RLHF is effective but expensive and finicky: you must keep several models running at once (the LLM being tuned, the reward model, usually a reference copy too), and the model can start gaming the judge — producing high-scoring answers that aren't actually better. That is reward hacking.
DPO (Direct Preference Optimization), which appeared in 2023, takes a shortcut: it showed that you don’t need the “judge” at all. Instead of building a separate reward model and wrestling with reinforcement learning, DPO takes ready “better answer / worse answer” pairs straight from humans and, in a single step, nudges the model to treat the preferred one as more likely and the rejected one as less likely. The hard RL problem turns into a plain classification task (“pick the better one”). Fewer moving parts — no reward model, no RL loop — means cheaper and more stable, which is why DPO quickly became the default for most open-source projects. Sebastian Raschka laid out the practical differences between PPO and DPO in detail in his 2024 write-ups.
A model goes through three training phases — in each it learns from a different source. Pick a phase and click to watch the mechanism step by step.
The model reads left to right and guesses the next word — it sees only what comes before the gap.
This is how virtually all generative models learn (GPT, Llama, Mistral, Claude).
The label is simply the real word already sitting in the text — nobody writes it. Each miss nudges the model’s weights a little (backpropagation), and training runs over trillions of words.
What are its key components?
A model's lifecycle in 2025 looks roughly like this:
- Pre-training on a huge raw corpus. Goal: language, grammar, world facts. Artefact: base model.
- Mid-training — a relatively new bridging stage. After main pre-training the model is trained further on high-quality data (math, code, "textbook-grade" synthetic data) with a decaying learning rate. The goal is to shift the model from rote memorisation toward abstraction and reasoning. Microsoft's Phi family and reports about Llama 4's training process showed that this stage substantially improves reasoning ability.
- SFT (Supervised Fine-Tuning) — the model learns to follow instructions, respect formats, and obey system roles.
- Alignment — RLHF, DPO, or the newer reasoning-focused RL methods (Reasoning RL, GRPO).
A finished model isn’t built in one step — it goes through several stages, each adding a new ability. Click the stages and watch the model grow.
Sitting alongside the pipeline, LoRA (Low-Rank Adaptation) is the dominant fine-tuning tool. Instead of updating every weight, it freezes the original parameters and inserts small low-rank matrices into selected layers (typically attention). Often less than 2% of parameters are updated, and final quality is comparable to full fine-tuning. The QLoRA variant adds 4-bit quantisation, letting you fine-tune a 65B model on a single consumer-grade GPU.
Fine-tuning doesn’t have to touch every weight. Compare three approaches on the model’s layers.
What can it be used for?
For a production team the practical decision boils down to three lanes: prompt engineering, RAG, and fine-tuning. Each solves a different problem.
- Prompt engineering wins when the behaviour change fits inside the context window (tens to hundreds of thousands of tokens). It is the cheapest, fastest, most iterative option. Limit: no long-term memory.
- RAG (Retrieval-Augmented Generation) is the default whenever you need fresh, verifiable facts that live outside both the context window and the model's training data. Documents are vectorised and stored in a vector database, the query is vectorised too, the system retrieves the top-K most similar fragments and injects them into the prompt. RAG has become the standard for enterprise chatbots over knowledge bases, technical docs, and policy archives.
- Fine-tuning is the right tool when you need to change how the model behaves — enforcing a strict JSON schema, learning industry jargon, locking down a tone across thousands of generations, matching an editorial voice. Fine-tuning is a poor tool for injecting new factual knowledge — that consensus, articulated by analysts at Kore.ai and by Sebastian Raschka among others, has solidified.
In practice teams build compound systems: a small LoRA adapter captures domain behaviour, RAG supplies the facts, prompt engineering orchestrates the flow.
How does it differ from other approaches?
| Dimension | Pre-training | Fine-tuning |
|---|---|---|
| Goal | foundational knowledge of language and the world | adjust behaviour, format, style |
| Data | trillions of raw tokens | thousands–hundreds of thousands of instruction/response pairs |
| Mechanism | self-supervised (next-token, MLM) | supervised + RL (RLHF, DPO, GRPO) |
| Cost | tens to hundreds of millions of USD | hundreds to thousands of USD (LoRA) |
| Hardware | thousands of H100 GPUs for weeks | single GPUs for hours |
| Who does it | ~10 global labs | community, startups, enterprises |
The philosophical difference: pre-training creates capabilities, fine-tuning steers capabilities. If a capability is not in the base model, fine-tuning will not invent it.
Key limitations and challenges
- Catastrophic forgetting — domain fine-tuning can overwrite previous knowledge. The more aggressive the run, the higher the risk that the model loses skills it had moments earlier.
- Perplexity curse — in continued pre-training, low perplexity?perplexity: A measure of the model’s “surprise” — how unexpected each next word is. Low perplexity means the model predicts the text well. on new documents does not correlate with the model actually using that knowledge. The model learns to recite the text without integrating it. Hence the push toward converting documents into Q&A pairs before training.
- Reward hacking — under RLHF the model can learn to game the reward model rather than genuinely improve. This is one of the main reasons the field has moved toward DPO and rule-based rewards.
- Knowledge cutoff — knowledge frozen in weights ages. Updating it requires retraining or layering RAG on top; fine-tuning alone is poor at injecting new facts reliably.
- Data bottleneck — frontier models are running out of high-quality public text. Hence the rising importance of synthetic data and knowledge distillation from larger models into smaller ones.
Why does it matter?
In 2025 pre-training and fine-tuning stopped being two points on a timeline and became names of full engineering disciplines. A striking example is the DeepSeek-R1 model, which showed that you can skip much of the SFT stage and apply massive Reasoning RL with verifiable rewards (RLVR), using GRPO (Group Relative Policy Optimization) instead of PPO. R1 effectively discovered chain-of-thought and self-correction by trial and error. Even more striking, its reasoning could be distilled into smaller open models (Llama, Qwen 32B) that match closed giants at a fraction of the cost.
For product teams the takeaway is simple: do not fine-tune to dodge the knowledge cutoff?knowledge cutoff: The model’s knowledge cutoff — the latest point its training data comes from. Anything that happened later is unknown to it.. Fine-tune to enforce structure, tone, and behaviour. Each need has its own lever:
- Facts → RAG.
- Domain knowledge → mid-training + SFT.
- Reasoning → RL.
Each lever operates at a different level and costs a different amount. Teams that know which lever to pull when ship cheaper and better products. Teams that treat fine-tuning as the answer to everything spend millions on something a RAG pipeline solves in an afternoon.
This specialisation keeps deepening. Mid-training is becoming a standard pipeline stage. Reasoning RL is reaching more and more frontier models. And LoRA plus distillation increasingly let small companies match cloud services locally — without writing a single line of CUDA.
Sources
- Sebastian Raschka — technical deep-dives on LoRA, DPO, and LLM training pipelines — sebastianraschka.com
- APX ML — Llama 3 405B training cost estimates — apxml.com
- Hugging Face — discussions and posts on DeepSeek-R1, GRPO and Reasoning RL — huggingface.co
- Toloka AI — explanations of SFT, RLHF and instruction tuning — toloka.ai
- Kore.ai — fine-tuning vs RAG vs prompt engineering analysis — kore.ai
- Interconnects (Nathan Lambert) — analysis of Reasoning RL and new alignment algorithms — interconnects.ai
