TL;DR (aka: Why this playbook matters)
Training a modern large language model (LLM) is not just “pick an architecture, grab a dataset, press go.” The polished research papers make it look like that — clean loss curves, elegant ablations, perfect hindsight. Reality? It’s chaotic, political, economic, and infrastructural. It’s 2 a.m. debugging a mysterious loss spike. It’s discovering that a tensor parallelism setting quietly crippled throughput. It’s restarting training after one trillion tokens because something subtle went sideways.
It’s also strategy: most teams should not train a model from scratch at all, and those who should need to be brutally clear about why, what, and how before lighting up 384 H100s.
This summary walks chapter by chapter through The Smol Training Playbook: The Secrets to Building World-Class LLMs, which documents how Hugging Face built SmolLM3 — a 3B parameter multilingual reasoning model trained on 11T tokens — and, more importantly, what they learned in the process. The book is part memoir, part blueprint, part scar-tissue manual. It covers: deciding whether it’s even rational to train your own model, standing up a reliable ablation pipeline, building your architecture, mixing the right data, instrumenting infrastructure, surviving loss spikes, and post-training.
The original document positions itself as the final piece after earlier releases on the dataset stack (FineWeb), infrastructure at ultra scale, and evaluation methodology. Said another way: this is where all of that comes together — GPUs, data, evaluations, and architectural choices — into an actual trained model you can ship. See also: FineWeb
We’re going to go chapter by chapter. We’ll keep the tone honest, keep links in the body like a normal blog article, and we’ll stay faithful to the source. No hallucinated features, no invented benchmarks, no fictional tuning tricks. Just what’s actually in the book.
Chapter 1. Introduction: training models is not clean, it’s war
The book opens by smashing the myth that LLM training is a neat pipeline. Papers tell a story of “clarity”: pick a clever architecture, curate a perfect dataset, run it on enough compute, profit. But that’s survivor bias. What you don’t read in papers: the 2 a.m. dataloader that silently shuffled tokens wrong; the instability spike at step 80K; the “subtle tensor parallelism bug (see later!) that quietly sabotages your training.” The reality is messy, iterative, and political. It’s full of choices that never get written down because, honestly, they’re kind of embarrassing.
Then it asks the fundamental question most people skip: what does it actually take to train a high-performance LLM today? Not in theory, not in marketing decks, but in production. The answer the authors propose is: a brutally honest look at the lifecycle of building SmolLM3, a 3B model that’s multilingual, math-capable, code-aware, and trained to run well on edge-ish devices (think phones). The model was trained on 11 trillion tokens — not million, not billion — and is meant to be small-but-smart, not just a shrunken giant, see: https://huggingface.co/HuggingFaceTB/SmolLM3-3B.
This “Introduction” frames the entire playbook as an operations log rather than a victory lap. It promises to share not just final settings, but the false starts, the painful restarts, and the late-game recovery work after unexpected training instability. And importantly: it says outright that many promising small-scale ablations did not extrapolate the way you’d hope, and that at one point they literally restarted training after 1T tokens. That is a terrifying statement if you’ve ever paid for H100 time, and it sets the tone: this isn’t heroic mythmaking. It’s a warning label for future you.
The chapter also orients you to the supporting ecosystem Hugging Face had already built leading up to this project:
- FineWeb / FineWeb-Edu, a massive high-quality pretraining dataset pipeline. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- Ultra Scale Playbook, which is basically infrastructure notes for making thousands of GPUs behave. https://huggingface.co/spaces/nanotron/ultrascale-playbook
- Evaluation Guidebook, a practical blueprint for measuring model ability, not just lowering cross-entropy. https://github.com/huggingface/evaluation-guidebook
By the time you’re reading this playbook, all the raw materials exist: data, infra, evals. Now comes orchestration.
Finally, the introduction lays out the structure of the full work. You don’t actually have to read it front-to-back like a novel. The book is made of four big arcs:
- Training Compass — strategy. Should you even train? What, specifically, should you train? And how?
- Pretraining — the recipe. Ablations, data mix, architecture, hyperparameters, and surviving the long haul.
- Post-training — SFT, DPO, GRPO, merging. Turning a raw base model into something deployable and helpful.
- Infrastructure — the “industrial-grade oven.” Clusters, bottlenecks, communication patterns, debugging loss spikes in the middle of a 384-GPU job.
You can jump to the arc you need right now (maybe you only care about post-training tricks, or maybe you are literally on fire in infra and need to un-block your cluster).
The vibe of Chapter 1 is: romantic myths are cute; operational truth saves six months and several million dollars.

Chapter 2. Training Compass: why → what → how
Before you spin up a trillion-token run, the book asks a question the field often dodges out of ego: Should you even train this model at all?
The “Training Compass” chapter is a strategic diagnostic. The authors argue that machine learning culture is obsessed with optimization knobs — loss, throughput, new attention variants — but ignores a deeper gatekeeping question: “Do you even need to train?” Because the open ecosystem is burly now. We’re in a world where labs are dropping world-class open models almost daily — Qwen, Gemma, DeepSeek, Kimi, Llama, Olmo, etc.
The text stresses that these are not just research toys anymore; they’re production-grade, multilingual, reasoning-capable, code-aware, heavily fine-tuned agents with permissive licenses and active communities. You can often just… use them. Or continue pretraining them. Or post-train them. Not reinvent them.
The authors say a hard thing out loud: a lot of failed LLM efforts didn’t fail because of bad learning rates. They failed because someone with access to 100 H100s for three months said “let’s train a model!” — with no strategy, no product alignment, and zero clarity on who will use the model and why. Six months later, they had a model nobody wanted, and a burnt-out team. That happens constantly.
So the compass starts with two gating questions:
- Why are you training?
- What, specifically, are you trying to build?
If you can’t answer those coherently, you’re about to waste compute, credibility, and sleep.
2.1 The “Why” (Research, Production, Strategic Open-Source)
The book lays out three legitimate categories of “why,” and if you’re not in one of them, you probably shouldn’t be spinning up a full pretraining run:
(A) Research.
This is when you’re probing a real scientific / architectural / scaling / optimization question. Examples from the book:
- Can a new optimizer (e.g. Muon) scale to 10B+ models?
- Can reinforcement learning alone (without supervised fine-tuning) create reasoning skills? (Referencing DeepSeek-R1 style work.)
- Can small models trained only on synthetic “textbook” data reach competitive capability? (“Textbooks Are All You Need.”)
- Can we get competitive performance using purely openly licensed data, e.g. The Common Pile v0.1 (8TB of open/public domain text)?
Research pretraining is valid if you’re making knowledge, not just a clone.
Links referenced: - https://huggingface.co/papers/2502.16982
- https://huggingface.co/papers/2501.12948
- https://huggingface.co/papers/2306.11644
- https://huggingface.co/papers/2506.05209
(B) Production.
This is for a company with a very specific use case that off-the-shelf models just can’t handle. The book gives three main reasons this happens in production:
- Domain specificity. For example, if you’re doing genomics or high-regulation finance, you need deep mastery of niche vocabulary and structure.
- Deployment constraints. You might need ultra-low-latency, on-device inference, in-airgap/on-prem, or in weird environments (e.g. drones, FPGAs). You care about power draw and memory layout.
- Governance / safety / regulatory control. You need full provenance over the data, you need to satisfy regulators, and you must be able to prove and audit behaviors. Sometimes you are legally required to own model weights + data chain-of-custody.
Important nuance here: the authors strongly recommend stress-testing the “we need our own model” instinct before committing. Before training from scratch, try building on something like Qwen3, Gemma3, etc. (which are open, high-performing, and constantly iterated), and see if you can get there with prompting, tool-use, or fine-tuning. If you can, then don’t train from scratch. If you can’t, then yes — you have a production-justified reason.
Some of those model families are here:
- https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f
- https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d

(C) Strategic Open-Source.
This is about filling a gap in the open ecosystem. Maybe there’s no strong on-device model with truly long context. Maybe multilingual models exist but under-serve low-resource languages. Maybe the world is racing toward world-model agents (the book nods at Genie 3, from DeepMind, described as an “interactive world-model”), and there’s no open-weight equivalent for the community yet.
If you can fill a gap that the ecosystem actually cares about, you are not just “making another LLM,” you are creating infrastructure others will build on. That’s real value. https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/
In other words: “We have GPUs and vibes” is not a strategy. “We believe we can be the best multilingual 3B on-device reasoning model with 1M context length and we have the data and recipe to prove it” — that’s a strategy.
2.2 Hugging Face’s own “why”
The authors walk through Hugging Face’s internal story to prove this isn’t theoretical. They trace a progression:
- BLOOM via BigScience: an open alternative to GPT-3 at a time when GPT-3 was closed. BigScience wasn’t just a model, it was a global workshop to build the entire stack — tokenizer, corpus, infra — required to train a 175B model openly. https://huggingface.co/bigscience/bloom, https://bigscience.huggingface.co/
- StarCoder / StarCoder2: when OpenAI’s Codex (and GitHub Copilot) showed the power of code models, Hugging Face + ServiceNow launched BigCode, built The Stack dataset (https://huggingface.co/datasets/bigcode/the-stack), and trained StarCoder as an open alternative. StarCoder2 scaled that into a family (3B/7B/15B) and trained longer, not just bigger. https://huggingface.co/bigcode/starcoder, https://huggingface.co/collections/bigcode/starcoder2-65de6da6e87db3383572be1a
- SmolLM → SmolLM2 → SmolLM3. SmolLM started by targeting the “small models are underrated” segment. SmolLM2 pushed “train longer on better data,” and SmolLM3 scaled to ~3B parameters, adding hybrid reasoning, multilinguality, and long context — all in a size they considered viable for phones. https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966, https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9, https://huggingface.co/HuggingFaceTB/SmolLM3-3B
They also call out other Hugging Face directions like Zephyr (RLHF / DPO work), Open-R1 (reproducing DeepSeek-R1 style reasoning distillation), OlympicCoder (competitive programming), SmolVLM for vision, and SmolVLA for robotics / embodied control. See:
- Zephyr: https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha, https://arxiv.org/abs/2310.16944
- Open-R1: https://github.com/huggingface/open-r1
- OlympicCoder: https://huggingface.co/open-r1/OlympicCoder-7B
- SmolVLM: https://huggingface.co/collections/HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39
- SmolVLA: https://huggingface.co/lerobot/smolvla_base
These are not vanity projects. They map to gaps. Each one exists because “somebody needs X and no strong open version exists yet.”
In other words, Hugging Face practices what it preaches: don’t train to flex. Train to fill an unmet need in the ecosystem, or to answer a sharp research question, or because governance forces you to.

2.3 The “What”: translating goals to specs
Once you have a real “why”, the next lever is “what you’re actually building.” The book defines “what” in concrete terms:
- Model type (dense transformer? MoE? hybrid?)
- Parameter count / target size
- Tokenizer / vocabulary size
- Context length
- Data mix (English-heavy? multilingual? math? code? long-context documents?)
- Intended assistant behavior (is this a reasoning agent? a code assistant? a multilingual chat model for mobile?)
This matters because constraints ripple. Want an on-device assistant? You can’t just clone DeepSeek V3 with 671B parameters. You need a dense or hybrid model that actually fits in mobile memory. Want extreme context length for retrieval-augmented workflows? Then you care about attention variants and positional encodings that don’t collapse at 128K+ tokens. Want multilingual performance in low-resource languages? You probably need a tokenizer with a bigger vocabulary that’s fair to those languages, plus careful data sampling.
The authors break this planning phase into two stages:
- Planning phase. Start from the “why,” map it to concrete architectural, data, and deployment constraints. E.g. “our device budget demands ~3B dense with efficient memory,” or “we promise regulatory auditability so we need fully licensed data.”
- Validation phase. Once you’ve got your draft spec, you don’t trust your intuition. You test. Systematically. With ablations. You focus on changes that can materially help, not everything that looks shiny on Twitter.
That last piece leads directly to the next chapter: ablations.
2.4 Team design: speed, data obsession, and headcount reality
The Training Compass wraps by talking about teams, and this part is honestly spicy. The authors argue that winning teams in LLM training share two traits:
- Faster iteration loops. The teams that train every quarter will outrun the teams that ship one model a year. Training LLMs is “learning by training.” Skill compounds with reps. They explicitly name groups like Qwen and DeepSeek as examples of fast, relentless iteration cultures.
- Data fanaticism. Architecture is cool, but data quality dominates. The truly elite teams are the ones who obsess over curation, cleaning, mixing, deduplication, domain balancing, multilingual coverage, etc. Over and over the book insists: data is the leverage.
Also: you don’t need a 50-person team to pretrain a frontier-quality model if your scope is focused. The book claims you can pretrain something on the order of a Llama 3–class model with “a handful of people equipped with enough compute to execute,” possibly 2-3 core engineers for the main pretraining loop. You only start needing more specialists (multimodal, multilingual, post-training specialists, RLHF, eval engineering, etc.) once you branch into lots of downstream tasks.
That’s humbling. And clarifying. It means most startups shouldn’t try to reproduce OpenAI’s entire org chart. They should scope tightly, iterate quickly, and make data the hill they’re willing to die on.
Chapter 3. Every Big Model Starts With a Small Ablation
Now we leave strategy and enter practice. This chapter is brutally pragmatic. You’ve chosen a lane. Now: how do you actually choose architecture, optimizer, learning rate schedule, data mixture, tokenizer, context length… without guessing?
The answer is ablations. Systematic, controlled experiments. You generate evidence before you bet months of compute.
The book makes a key philosophical point: outsiders often imagine that high-stakes LLM design decisions are made through grand pure-theory reasoning. Reality: yes, you need strategic thinking, but intuition alone is not reliable. Things that “feel” obviously good sometimes backfire hard.
One example they give: arXiv. It looks like high-quality STEM text, dense with formal reasoning. Intuitively, you’d expect that to supercharge small models. But when you actually pretrain smaller models heavily on arXiv, performance can drop on broad benchmarks. Why? Because arXiv prose is weird: hyper-formal, narrow-domain, and stylistically unrepresentative of general language. The model overfits to “paper voice” and loses general world reasoning and casual fluency. For tiny models especially, overfeeding them narrow academic style can hurt them. (They cite Shao et al., 2024 for this finding.)

The moral: “best-looking data” ≠ “best-performing model,” especially at small scale. You only learn that by testing.
So, the authors define two properties of good ablation experiments:
- Speed. You need fast turnaround. The faster you can loop, the more variants you can test, and the smarter you become. A team that can run many ablations becomes dangerous; a team that can’t is stuck in guess-land.
- Reliability. Your experimental setup has to produce meaningful, discriminative signal. If your eval suite is noisy, or if your runs are too short to show differences, you’ll chase ghosts.
This “fast but reliable” tension is the heart of practical LLM R&D. You’re constantly trading off experiment cost against how much certainty you need to lock in a decision before main training.
They also say: before ablations, you must lock basics like model type and target size, because those constraints tell you what a “baseline” even is. For SmolLM3, Hugging Face targeted a dense Llama-style 3B model, aiming at on-device deployment and multilingual/mathy/codey reasoning. For your use case, you might need MoE or hybrid. Those decisions affect data loading, optimizer behavior, throughput characteristics, and inference costs, so you need them set early.
This tees us up for the next microchapter: choosing the baseline.
Chapter 4. Choosing Your Baseline
This part sounds almost boring, but it is absolutely not boring. It is where most people go wrong.
The authors say: nearly every successful modern model is not invented from nothing. Somebody starts from a strong baseline — an existing proven architecture + optimizer stack that has already been battle-tested at scale — and then adapts it. Qwen started from Llama-style transformer DNA. Meta used Llama 2 as the precursor to Llama 3. Kimi’s K2 traces lineage to DeepSeek-V3’s MoE design. This is not plagiarism. This is how progress compounds.
Why you should start from a known-good baseline:
- Strong architectures and optimizer recipes take years — literally years — to de-bug, profile, and stabilize.
- Openly available families like Llama, Qwen, Gemma, etc. have already survived multi-trillion-token runs.
- You inherit institutional memory you did not have to pay for.
- You avoid re-discovering catastrophic edge cases mid-flight.
The book even gives you a non-exhaustive menu of “2025 baseline options,” across dense, MoE, and hybrid configurations, and across sizes from sub-billion to hundreds of billions of (active) parameters:
- Dense:
- Llama 3.1 (8B, 70B)
- Llama 3.2 (1B, 3B)
- Qwen3 (0.6B, 1.7B, 4B, 14B, 32B)
- Gemma3 (12B, 27B)
- SmolLM2 / SmolLM3 (135M, 360M, 1.7B, 3B)
- MoE:
- Qwen3 MoE (e.g. “30B-A3B,” “235B-A122B”)
- GPT-OSS (21B-A3B, 117B-A5B)
- Kimi Moonlight 16B-A3B
- Kimi K2 (1T-A32B)
- DeepSeek V3 (671B total, ~37B active)
- Hybrid:
- Zamba2 (1.2B, 2.7B, 7B)
- Falcon-H1 (0.5B → 34B)
- Qwen3-Next (80B-A3B, MoE + Hybrid)
- MiniMax-01 (456B-A46B, MoE + Hybrid)
Links in that table include:
https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f
https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f
https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4
https://huggingface.co/moonshotai/Moonlight-16B-A3B-Instruct
https://huggingface.co/deepseek-ai/DeepSeek-V3
https://huggingface.co/Zyphra/models?search=zamba2
https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
https://huggingface.co/MiniMaxAI/MiniMax-Text-01
The explicit advice:
- Pick something close to your target parameter count.
- Don’t overthink it at first.
- You’re allowed to evolve it.
- You’ll derisk any deviation through ablations anyway.
This is the opposite of ego-driven “we’re building everything from scratch because we’re geniuses.” The book’s position is clear: ego kills timelines.
Chapter 5. Modifying Your Baseline: The Discipline of Derisking
Okay, you’ve picked your starting point. Now you want to tweak it. You want longer context, different attention, tied vs. untied embeddings, new positional encodings, more KV heads, NoPE vs RoPE, etc. How do you not blow yourself up?
This chapter gives you one law:
Never ship a change you haven’t derisked.
Read that again. Then tape it to your monitor.
A change is considered “derisked” if:
- You’ve tested it in ablation and it actually improves your target capabilities, OR
- It gives you tangible operational benefit (e.g. faster inference, lower memory, better stability) without harming performance beyond what you consider acceptable.
The tricky part: the combinatorics. Your model has dozens of knobs you could turn — attention type; positional encodings; activation functions; optimizer choice; learning rate schedule; normalization schemes; embedding tying; tokenizer vocabulary size; context window; data mix; etc. You cannot grid-search that full space. You will go broke and lose all signal. Instead you do strategic experimentation: before testing any modification, ask two questions:
- Will this help my specific use case?
- Will this optimize my training (speed, stability, cost)?
If the answer to both is “meh,” skip it. Don’t chase hype.
The chapter also introduces a disciplined loop:
- Test one promising modification against your current baseline.
- If it works, fold it into the baseline and that becomes the new baseline.
- Then test the next thing.
This prevents you from combining six unproven tricks and having no idea which one helped or which one silently poisoned stability.
The tone here is almost parental: “Please don’t be reckless. Please. You will thank us when you’re 9T tokens in and not trying to explain to leadership why the model forgot how to divide.”
Chapter 6. Picking a Training Framework
Framework choice sounds like tooling, but in practice it defines your daily pain.
The book lists hard requirements for the training framework you use for ablations and for the main run:
- It must actually support your desired architecture (dense, MoE, hybrid, etc.), or be easily extensible.
- It must be stable and production-ready enough not to silently explode mid-run.
- It must deliver strong throughput on your hardware, because iteration speed is life.
They compare four main options the team considered (or built):
- Megatron-LM (NVIDIA): battle-tested, powers systems like Kimi K2 and NVIDIA-aligned giant models. Pros: throughput, maturity, 3D parallelism expertise. Cons: heavyweight, harder to modify if you’re new. https://github.com/NVIDIA/Megatron-LM
- DeepSpeed: pioneered ZeRO and large-scale parallelism, and has powered BLOOM and GLM. Strong, but large and complex codebase (the book cites ~194k lines of code), which can intimidate newer teams trying to hack in custom behavior or debug weirdness. https://github.com/deepspeedai/DeepSpeed
- TorchTitan: a newer PyTorch initiative. Much lighter, modular, easier to navigate. Great for rapid experimentation and dense models, but less battle-tested than Megatron or DeepSpeed and still stabilizing. https://github.com/pytorch/torchtitan
- Nanotron: Hugging Face’s own in-house framework, later open-sourced. Built originally to give them full flexibility and deep visibility into pretraining internals, and it eventually evolved into the “Ultra Scale Playbook.” Nanotron now supports all production features they need for training dense models, though MoE support is still being built out. (Ultra Scale Playbook: https://huggingface.co/spaces/nanotron/ultrascale-playbook).
Key insight: rolling your own stack (like Nanotron) made sense for Hugging Face because they were accumulating institutional expertise and wanted to codify it. But it is not necessarily the “move” for everyone. You pay in engineering hours, debugging, missing features, and on-call stress. A lighter option is to fork an existing framework like TorchTitan and adapt it. The authors even mention a lab (Thinking Machines Lab) that built their internal pretraining library as a TorchTitan fork.
Bottom line: choose based on your team’s maturity and your risk tolerance. If multiple frameworks can do what you need, benchmark throughput on your hardware and bias toward the thing you can iterate with fastest. Speed of iteration is non-negotiable.
Chapter 7. Ablation Setup (How to actually run the experiments)
After picking a framework, you design the ablation environment itself — the miniature lab where you test architectural decisions under controlled conditions.
The goal: run small experiments that give transferable signal about your eventual giant run. You want to know, at low cost, which ideas are good. Then you only scale the proven ones.
The book says there are two main approaches:
Approach A: Train the real target model size, but on fewer tokens.
For SmolLM3 ablations, they trained the full 3B-parameter configuration — the real eventual shape of the model — but only on 100B tokens (instead of the final 11T). That lets you evaluate architectural decisions without paying full price.
Approach B: Train a smaller proxy model.
If the target is huge (like a trillion-parameter MoE with tens of billions of “active” parameters per token), you spin up a mini version — same style, same qualitative characteristics, but drastically cheaper — and test there. The authors note that Kimi’s K2 project (1T total params, ~32B active) used a ~3B MoE with ~0.5B active parameters to do exploratory ablations. That made experimentation feasible.
Do small-scale results transfer? The authors say:
- If something hurts performance at small scale, it’s almost always safe to discard it at large scale.
- If something helps at small scale, you still need to be cautious. You should train for a sufficiently long number of tokens to convince yourself it generalizes upward. The closer your ablation setup is to your final run (in depth, context length, data mix), the more trustworthy your signal.
For their ablations in this playbook, they use a baseline “vanilla transformer” in two flavors:
- A 1B parameter transformer (Llama 3.2 1B–style config) trained on ~45B tokens, which can be trained in about 1.5 days on a single node with 8×H100s, using their Nanotron config. They report ~42K tokens/sec/GPU throughput.
- A 3B parameter transformer trained on ~100B tokens, which is closer to the final SmolLM3 shape.
Links: - Llama-3.2-1B reference: https://huggingface.co/meta-llama/Llama-3.2-1B
- Nanotron configs: https://huggingface.co/datasets/HuggingFaceTB/training-guide-nanotron-configs
- Baseline config: https://huggingface.co/datasets/HuggingFaceTB/training-guide-nanotron-configs/blob/main/baseline_config_1B.yaml
This is an important operational data point: ~1.5 days, 8×H100s, for a 1B/45B-token run that gives you architectural signal. That is a budgetable iteration loop for a small, focused team. You don’t need a 512-GPU superpod just to learn whether tied embeddings are okay.
7.1 Anatomy of the baseline config
The playbook actually peels open the baseline YAML config and explains its structure. High-level components:
- Datasets and mixing weights.
They’re mixing:- FineWeb-Edu: curated high-quality English web data.
- Stack-Edu-Python: educational code / Python data.
- FineMath / FineMath-3plus: math reasoning data.
Each dataset has an explicit weight (e.g. 0.71, 0.21, 0.11 in the example snippet) to shape the model’s early skill profile (language, code, math) while staying simple. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu, https://huggingface.co/datasets/HuggingFaceTB/stack-edu, https://huggingface.co/datasets/HuggingFaceTB/finemath
- Model architecture.
The example config is basically a Llama 3.2–style 1B model:- hidden_size: 2048
- num_hidden_layers: 16
- num_attention_heads: 32
- num_key_value_heads: 8 (i.e. grouped-query-ish setup rather than full MHA everywhere)
- intermediate_size: 8192
- max_position_embeddings: 4096
- rope_theta: 50000.0
- tie_word_embeddings: true
So: a dense decoder-only transformer with RoPE-style positional encoding, mid-size head counts, and ~4K context length.
- Training hyperparameters.
They’re using AdamW with gradient clipping (clip_grad: 1.0), cosine LR decay, linear warmup, betas ~0.94 / 0.95, eps 1e-08. Learning rate ~5e-4 with warmup to step 2,000 and cosine decay after ~2K steps. This is very standard modern LLM practice: warm up, coast, cosine down. - Parallelism.
For the 1B setup, it’s just data parallelism across 8 GPUs (dp: 8), with no tensor or pipeline parallelism (tp:1, pp:1). At 1B scale you don’t need fancy 3D parallelism yet. - Tokenizer.
tokenizer_name_or_path points at HuggingFaceTB/SmolLM3-3B, tokenizer_max_length 4096, etc. I.e. they lock in tokenizer behavior so results are consistent. https://huggingface.co/HuggingFaceTB/SmolLM3-3B - Batching / tokens.
Global batch size is computed as (data parallel degree) × (gradient accumulation) × (micro batch size) × (sequence length). They’re targeting ~1.5M tokens per optimization step in this example, run for ~20K steps, totaling ~30B tokens for the ablation setting. That is enough to get meaningful signal without spending weeks.
The book then hammers one operational rule:
Modify one thing at a time.
You must isolate variables. If you change attention type and tokenizer and LR schedule and see a +2% bump in HellaSwag, you’ve learned nothing, because you don’t know which change did it. Make a change, test it, lock it in if it works, then move on.
They also talk about tracking parameter counts when you change architectural pieces. For example:
- Untying embeddings (separate input/output embeddings) can dramatically change parameter counts.
- Switching MHA → GQA or MQA can reduce attention parameter cost.
When you compare models in ablation, you need to make sure you’re comparing fairly sized systems or at least reasoning about the parameter delta honestly. They even include a short Python helper usingLlamaConfig/LlamaForCausalLMfromtransformersto estimate total parameter counts given heads, layers, hidden size, etc. https://arxiv.org/abs/2410.06511, https://arxiv.org/abs/2507.20534
This is the kind of boring detail that prevents you from fooling yourself with unfair apples-to-oranges comparisons.
Chapter 8. Evaluation: How Do You Know What “Worked”?
You ran ablations. Now you have runs. Now what? You need to score them. This sounds simple — “just look at the loss curve” — but the book argues that naïvely looking only at loss is dangerous.
Why loss alone is not enough:
- Some datasets (like Wikipedia) produce artificially low next-token prediction loss because they’re homogenized and predictable. That doesn’t mean your model is better for downstream reasoning.
- Tokenizer changes break comparability: different token splits lead to different reported losses.
- Some capabilities (reasoning, math) barely show up in average loss until quite late.
- Models can keep getting better on downstream tasks after perplexity plateaus. (They nod to Liu et al., 2022 to make this point.)
So the authors propose a more robust evaluation strategy using downstream benchmarks that test reasoning, world knowledge, math, code, and long-context ability. They emphasize four principles for choosing “good ablation benchmarks” early in training:
- Monotonicity. Scores should trend up as training proceeds. If a metric randomly wanders, it’s useless for decision-making.
- Low noise across seeds. If two seeds of the same setup score wildly differently, that benchmark is not stable enough for early guidance.
- Above-random performance early enough. If performance sits at random for 80% of training, you get zero early signal and can’t use it to discriminate architectural choices.
- Ranking consistency. If setup A outperforms setup B at step X, that ordering should mostly persist later. You want consistent relative ordering.
They connect this philosophy to their FineTasks / FineWeb evaluations work. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks
They also talk about task formulation. Benchmarks can be:
- MCF (Multiple Choice Format): The model picks A/B/C/D (like classic MMLU).
- CF (Cloze Formulation): You don’t explicitly present A/B/C/D; you compute which answer token sequence is most likely under the model.
- FG (Freeform Generation): The model must generate an answer, like solving a GSM8K math word problem and writing the numeric result, or producing code for HumanEval.
Key operational guidance:
- For pretraining ablations, CF and MCF are gold. Freeform gen (FG) is often too hard and too noisy early on, especially before instruction tuning / post-training.
- For post-trained models, FG becomes primary because you actually care if the model can generate helpful responses. (Think chat assistants, reasoning agents, coding copilots.)
The playbook then shows the evaluation suite they use for ablations. It includes:
- MMLU (broad academic knowledge over 57 subjects)
- ARC (grade-school science reasoning)
- HellaSwag (commonsense continuation)
- WinoGrande (pronoun resolution with commonsense)
- CommonSenseQA (everyday reasoning)
- OpenBookQA (elementary science facts + reasoning)
- PIQA (physical commonsense)
- GSM8K (grade-school math word problems)
- HumanEval (code generation; synthesize Python functions from docstrings)
- RULER for long-context stress testing in some cases
They run ~1,000 questions per benchmark for speed (except for math/code/long-context cases where they sometimes evaluate full sets in the 3B-scale experiments). They score with LightEval. https://github.com/huggingface/lighteval, https://arxiv.org/abs/2403.15796, https://arxiv.org/abs/2406.08446, https://arxiv.org/abs/2406.11794
There’s also a really nerdy but important detail about CF scoring: they normalize log-probability by character length. Why? Without normalization, the model might favor short answers simply because they’re short. Length-normalized likelihood avoids biasing toward shorter candidates. That’s the level of paranoia you need when tiny eval differences are steering million-dollar training decisions.
Finally, they stress:
Before you trust any evaluation suite, first confirm you can reproduce published results for known baselines. Do smoke tests. Manually inspect prompts, post-processing, answer extraction. Evals are steering literally everything. Treat them like production code, not weekend scripts.
And they lay down the “rules of engagement” for experimentation:
- Validate your eval suite.
- Change one thing at a time.
- Train long enough to get signal.
- Use enough eval coverage to make decisions with actual confidence.
The emotional throughline here is paranoia. “Be paranoid. Test every change, no matter how small. Don’t underestimate that seemingly harmless library upgrade that ‘only changed two lines.’” Because that two-line change can cascade into subtle regressions that eat you alive 3T tokens later.

Chapter 9. Designing the Model Architecture
Now we hit the “what does the model actually look like?” chapter — the heart of pretraining design.
The authors restate something that sounds obvious but is often skipped in practice: every architectural choice you make should ladder back to the “why” and “what” from the Training Compass. Are you training for on-device multilingual assistants? For 1M-token context retrieval reasoning? For math and code ground truthing? Those goals dictate the trade-offs you’re allowed to accept.
They use SmolLM3 as the worked example. Hugging Face’s goals for SmolLM3 were:
- Strong on-device performance.
- Competitive multilingual ability.
- Solid math and coding capacity.
- Robust long-context handling.
- Total project timeline ~3 months.
Given those constraints, they opted for a dense 3B-parameter model, not a Mixture-of-Experts system. Why? Because MoE/hybrid models complicate inference memory, routing, and deployment footprints — not great for phone targets. A 3B dense model is still small enough to fit comfortably on phones (their words) while big enough to express reasoning and multilingual skills if you train it long and with the right data. https://huggingface.co/HuggingFaceTB/SmolLM3-3B
That matters: they did not just chase SOTA for bragging rights. They tuned architecture for deployment reality. This echoes the “Production” justification from earlier in the book.
Then the chapter zooms out and says: look across modern frontier-ish models — Qwen3, Gemma3, DeepSeek V3, OLMo, Kimi, SmolLM — and you’ll notice something. Under the marketing differences, they’re all still transformer descendants. The transformer core from “Attention is All You Need” has not been replaced; it’s been sharpened. The big changes in 2023–2025 aren’t “throw away the transformer,” they’re refinements to its sub-components to handle scaling, memory limits, long context, multilingual spillover, inference cost, etc. (They mention attention evolution like MHA → GQA / MQA / MLA, positional strategies like RoPE vs NoPE vs partial RoPE, etc.)
In other words — and this is quietly reassuring — you don’t have to invent alien architecture to be competitive. You need to:
- Choose an attention variant that matches your latency/throughput/long-context needs.
- Choose positional encoding / context extension strategy (RoPE? NoPE? partial RoPE? etc.) that won’t collapse at 128K+ tokens.
- Decide tied vs untied embeddings, activation functions, normalization schemes.
- Decide tokenizer approach (reuse a known tokenizer vs. train your own multilingual / code-aware / long-context-friendly tokenizer).
- Decide dense vs MoE vs hybrid.
The book promises (later in this chapter and forward) deep dives into attention mechanisms, positional encodings, tokenizer design, and long-context tricks like NoPE + intra-document masking — both of which they explicitly call out as architectural choices they made early in SmolLM3 to maximize long-context stability after struggling with context extension in SmolLM2. In SmolLM2, long context was bolted on later and was painful. In SmolLM3, they designed for long context from the start using NoPE (No Position Embeddings / a variant of position handling) and intra-document masking. This is a classic example of “learn from previous scars and bake it in day one.”
Also: they point out that the field is trending toward hybrid and MoE layouts, especially for giant-scale systems like DeepSeek V3 (671B params, ~37B active), Qwen3-Next, and MiniMax-01 (456B-A46B). Those models dynamically activate subsets of experts to get effectively huge capacity without fully paying huge-per-token compute. But again, that’s not always what you want if your product constraint is “must run on consumer hardware.” The book is very anti-one-size-fits-all.
This chapter, in effect, tells you this:
Architecture isn’t fashion. Architecture is budget, latency, deployment, multilingual coverage, math ability, and fault tolerance — made concrete. Pick what serves your use case, then derisk each deviation from baseline with ablations before you lock it into the main run.
That’s the design discipline they’re pushing.
Chapter 10. Infrastructure (Preview)
The “How to Read This Blog Post” section (back in Chapter 1) calls infrastructure “the industrial-grade oven.” The metaphor is deliberate: pretraining is dough, post-training is icing, but infra is the oven that can either bake your cake or burn down your kitchen. The chapter we only glimpse from the excerpt promises walkthroughs of:
- GPU layout and 3D parallelism trade-offs.
- Communication patterns between CPU, GPU, nodes, and storage.
- How to identify and remove bottlenecks.
- How to debug catastrophic issues like mysterious throughput collapse or unstable loss spikes while the main run is burning hundreds of GPUs.
It frames infra as both invisible and existential — if your cluster is misconfigured or your comms pattern is bottlenecked, nothing else in the book ships.
It also hints at some of the war stories: “loss spikes,” “handling loss spikes,” “midtraining,” tensor parallelism bugs. The Training Compass outline literally has bullets for “Setup infra,” “Handling loss spikes,” and “Midtraining,” implying that infrastructure work is not just “set up NCCL and go.” It’s an active discipline during training — you may need to stop, stabilize, or even restart after 1T tokens if you discover a subtle scaling pathology. (Yes, they really did that.)
This section matters because infra is where theory meets thermals. Every elegant ablation you ran is useless if your interconnect saturates or your storage pipeline can’t feed 1.5M tokens/step without stalling, or your tensor parallelism is silently corrupting gradients. The playbook treats infra not as back-office work, but as a co-equal pillar of model quality.
Chapter 11. Post-Training (Preview)
Post-training is framed in the Introduction as the “icing and cherry on top,” and it’s its own major part of the book. The authors call this “the post-training alphabet”:
- SFT (supervised fine-tuning)
- DPO (Direct Preference Optimization)
- GRPO (a policy optimization method along RL lines)
- Merging (model merging / knowledge transfer / capability grafting)
The claim is blunt: most of the real-world value users see from LLMs comes after base pretraining, not during it. You can have a gorgeous pretraining loss curve and still end up with a useless assistant unless you post-train it to behave, reason, hold context, and follow instructions.
They warn that most of what actually makes these post-training methods work in production is not in papers — it’s in “painful lessons,” trial-and-error, and dark arts. They promise to unpack those details so you don’t repeat the same pain. And they also preview something subtle and important: merging. The book hints that you don’t always have to train end-to-end to acquire new capabilities; sometimes you can surgically merge specialized models and get composite behavior, instead of re-running a giant RLHF pipeline every time. That’s part of the “alchemy” they tease.
This is huge strategically because it means “train once, specialize many times.” Which is, in practice, how you ship product variations fast.
Chapter 12. Cost, Compute, and Sanity
One of the most sobering tables in the book is the compute accounting for SmolLM3. They show total GPU-hours across:
- Main pretraining run
- Pretraining ablations
- Mid-training ablations
- Debugging + restart after a failure
In raw numbers, they report: - Main pretraining run: 384 GPUs × 30 days = 276,480 GPU-hours
- Ablations (pretraining): 192 GPUs × 15 days = 69,120 GPU-hours
- Ablations (mid-training): 192 GPUs × 10 days = 46,080 GPU-hours
- Training reset & debugging: 384/192 GPUs × 3/4 days = 46,080 GPU-hours
- Total: 437,760 GPU-hours
This is the punchline: ablations + debugging consumed more than half the compute of the “main” run.
Pause on that. Most naive budgets plan for “the run.” The book says the real bill is:
main run + ablations + mid-training ablations + panic-mode debugging.
That has two implications:
- You must budget compute for learning, not just for “final training.” If you don’t, you’ll either skip ablations (and fly blind) or run out of GPUs mid-crisis.
- You must budget for failure. They had to restart after scaling issues. That wasn’t hypothetical. That was real. If you don’t leave buffer for that, you die politically the first time you have to ask leadership for “just two more weeks and 150 more GPUs.”
This is why the earlier chapters obsess over derisking, evaluation hygiene, and paranoia. The cost of sloppiness is not “oops, we lost an afternoon.” It’s “oops, we lost six figures in GPU time and maybe our credibility.”
Closing Reflections
By the time you finish this playbook, one theme is screamingly obvious: LLM training is not just an optimization problem. It’s an organizational discipline.
Here are the cultural laws implied by the book:
- Have a spine about “why.”
Before writing a single line of distributed training code, you must answer “Why are we training this model?” The acceptable answers are (A) research, (B) production constraints you can’t meet otherwise, or (C) a strategically important gap in the open ecosystem you can realistically fill. Anything else — “we’ve got GPUs,” “AI is the future,” “the board wants AI” — is noise and will probably end in tears. - Translate “why” into a concrete “what.”
Model size, architecture type, tokenizer, data mix, context length, deployment target. Lock those in early, because they define almost everything downstream. - Derisk every choice.
You don’t “hope” an attention variant is better. You prove it with ablations on a scaled-down but faithful proxy. You don’t trust eval code you haven’t validated. You don’t change two things at once unless you want to spend three days diffing through logs to see which “tiny tweak” nerfed math reasoning. - Evaluate like a scientist, not like a hypebeast.
Loss is only part of the story. Early-phase benchmarks must be monotonic, low-noise, and discriminative, and they must measure capabilities you actually care about (world knowledge, reasoning, math, code, long context). Use LightEval (https://github.com/huggingface/lighteval) with care. Normalize log-probs. Sanity check prompts. Manual spot check output. Treat evaluation code as production infra, because it is. - Respect infrastructure.
Infra is not “IT.” Infra is the oven. It’s where throughput lives or dies, where communication patterns either hum or choke, where you either catch a loss spike at 200B tokens or you pretend you didn’t see it and pay for it 800B tokens later. The book very explicitly positions GPU layout, CPU/GPU/node/storage comms, and debugging of distributed training instabilities as first-class engineering challenges. - Budget honestly.
The main run is not the cost. The process is the cost. Ablations, restarts, investigation, mid-training experiments, post-training. Plan for ~2× the naive number, minimum — and plan emotionally for the moment you have to tell leadership, “We are restarting from 1T tokens in because we found an interaction bug in tensor parallelism that invalidates the rest of the curve.” That conversation will happen. This playbook is candid about that. - Iterate fast, obsess over data.
The best teams aren’t necessarily the biggest. They’re the ones that train often, learn ruthlessly from each run, and treat data curation as sacred. Qwen, DeepSeek, and similar teams have become household names not because of one miracle model, but because they ship, learn, ship again, tighten feedback loops, and keep their data pipelines razor sharp. - Post-training is where users feel it.
Pretraining builds the brain. Post-training teaches the brain to talk to humans, follow instructions, reason step-by-step, handle tools, refuse unsafe requests, generate working code, and act like a usable assistant instead of a math textbook that occasionally screams. This is SFT, DPO, GRPO, merging, guardrails, alignment. This is where you win hearts. https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha, https://github.com/huggingface/open-r1
If you zoom out, The Smol Training Playbook is not “how to copy SmolLM3.” It’s “how to stop lying to yourself.” It’s the missing ops diary of modern LLM training. It tells you:
- Training is a strategy problem first and an engineering problem second.
- Engineering (ablations, infra, evals) is how you defend that strategy against reality.
- Reality is harsh. Plan for that.
And maybe the most subversive message in the whole thing is this:
Most of you shouldn’t train a model from scratch at all.
You should borrow, fine-tune, merge, or surgically extend something that already exists — Qwen3, Gemma3, Llama 3.2, SmolLM2, StarCoder2, etc. — because the open-source scene is now pumping out “world-class models on a nearly daily basis,” and they are production-ready, multilingual, reasoning-aware, code-capable, and even phone-friendly. https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf, https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9, https://huggingface.co/bigcode/starcoder
The playbook simply hands you the mental model and tooling discipline required for the situation where you do have to train from scratch — whether because you’re inventing new science, handling a regulatory fortress, or shipping a capability nobody else is tackling at the size/speed/latency you need.
That’s its gift: brutal clarity, backed by scars.






