Distillation

Technical explanation

Knowledge distillation is how you get a small model to punch above its weight. The idea is simple. You have a large, accurate model (the teacher) that's too expensive to deploy. You train a smaller model (the student) not just on the original data, but also on the teacher's outputs. The student learns to imitate the teacher's behavior, capturing knowledge that would be hard to learn from scratch.

The magic is in what the student learns. When a teacher model outputs probabilities across classes, it reveals relationships between categories. A photo might be 70% cat, 25% dog, 5% other. That soft distribution contains more information than a hard label of just "cat." The student model learns these nuances, which helps it generalize better than training on labels alone.

Distillation works across model types. You can distill a large language model into a smaller one, a ensemble into a single model, or even a slow but accurate model into a fast approximation. DistilBERT is a famous example, achieving 97% of BERT's performance with 40% fewer parameters and 60% faster inference. More recently, many open-source LLMs have used distillation from larger proprietary models.

The limitation is that students can't exceed their teachers. If the teacher model has blind spots or biases, the student inherits them. And there's a floor to how small you can go before quality drops noticeably. Still, distillation remains one of the most effective techniques for deploying powerful models in resource-constrained environments.

Technical explanation

Related Terms

More in Infrastructure