Model Distillation: How Small Models Learn From Big Ones
You've maybe used a fast, cheap AI model and gotten a surprisingly good response. GPT-4o-mini, Claude Haiku, Gemini Flash: these smaller models perform close to their larger siblings at a fraction of the cost. The reason is that each one was trained on the outputs of a larger model, a process called distillation.
The core idea is that instead of training the small model on a dataset of correct answers, you train it on the large model's outputs: specifically, the probability distributions the large model computes over every possible next word.
When a model completes the prompt "The weather today is," it computes a probability for every word in its vocabulary before selecting one. "Beautiful" gets 28%. "Warm" gets 19%. "Nice" gets 14%. "Terrible" gets 5%. That ranking reflects which words are interchangeable, which are related, and which don't belong at all.
If you train a small model on just the final answers, all it learns is which word was picked. The full ranking, the relationships between alternatives, all of that is thrown away. Training the small model to match the big model's probability distributions instead means it inherits that structure. It learns that "warm" and "nice" are reasonable alternatives to "beautiful," and that "car" is not. That structural knowledge is what makes a distilled model better than one trained from scratch at the same size.
In practice, a small model trained normally might score 85% on a benchmark, while the same architecture distilled from a larger teacher hits 92% without any increase in size or inference cost.
Distillation is why small models are improving so quickly: they inherit what a larger model spent enormous compute learning from raw data, without needing that compute themselves.