Knowledge distillation is a technique for transferring the knowledge held by a large model (the teacher model) to a smaller model (the student model). Hinton et al.’s original knowledge distillation paper formalized the idea of training a smaller model from the teacher model’s output distribution.[1]
What Is Knowledge Distillation?
Section titled “What Is Knowledge Distillation?”Imagine a master chef (teacher model) who doesn’t just hand a recipe book (hard labels) to an apprentice (student model), but actually cooks alongside them — demonstrating the right heat for each ingredient, the texture to look for before moving to the next step — transferring cooking intuition rather than written instructions alone.
In standard supervised learning, a model trains only on hard labels (“this is a cat,” “this is a dog”). In knowledge distillation, the model trains on the teacher’s output probability distributions — for example, “cat: 85%, dog: 12%, tiger: 3%” — learning the full distribution of uncertainty across classes. This distribution is called soft labels, and it contains richer information than hard labels (0/1 correct answers).[1]
graph TD
I["Input data"] --> T["Teacher model (large, high accuracy)"]
I --> S["Student model (small, fast)"]
T --> SL["Soft labels / intermediate features"]
SL --> DL["Distillation loss"]
S --> DL
DL --> U["Update student model weights"]Three Types of Distillation
Section titled “Three Types of Distillation”1. Response-based Distillation
Section titled “1. Response-based Distillation”The student model mimics the final outputs of the teacher model, training on the teacher’s soft labels.
This is the simplest form:
- The teacher outputs a probability distribution over all classes
- The student outputs a probability distribution for the same input
- The student is trained to minimize the difference between the two distributions (KL divergence)
2. Feature-based Distillation
Section titled “2. Feature-based Distillation”The student model mimics the teacher’s intermediate representations (features from internal layers), going beyond the final output to transfer how the model internally constructs concepts.
This is harder to implement but allows the student to learn from the teacher’s “reasoning process” at a deeper level. DistilBERT uses a triple loss combining language modeling, distillation, and cosine-distance losses.[2]
3. Relation-based Distillation
Section titled “3. Relation-based Distillation”The student learns relationship patterns between data points — understanding that “sample A and B are similar, while C and D are dissimilar” — transferring structural knowledge rather than outputs or features.
Soft Labels and Temperature Scaling
Section titled “Soft Labels and Temperature Scaling”Standard softmax outputs can become overconfident, concentrating probability on a single class (e.g. cat: 99.9%, dog: 0.1%). In this case, soft labels carry little extra information over hard labels.
Temperature scaling introduces a temperature parameter $T$ into the softmax calculation, smoothing out the probability distribution:
$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$
Setting $T > 1$ flattens the distribution, making inter-class similarities more visible. Distillation typically uses $T = 2 \text{–} 4$.
Practical Distillation: Data Distillation
Section titled “Practical Distillation: Data Distillation”In the LLM era, data distillation (also called synthetic data distillation) is also used: a teacher model generates training data for the student. Microsoft’s Phi-2 post describes training with synthetic datasets and filtered web data.[3]
graph LR
D["Internal documents / references"] --> T["Teacher model (GPT-4 etc.)"]
T --> SD["Synthetic Q&A dataset"]
SD --> S["Fine-tuning the student model"]
S --> SM["Lightweight, fast specialized model"]Example workflow:
- Use a large teacher model (e.g. GPT-4) to generate Q&A pairs from internal documents
- Build a dataset from the generated output (thousands to tens of thousands of examples)
- Fine-tune a smaller open-source model (Llama, Mistral, etc.) on that data
- Achieve performance close to the teacher on the target domain at a fraction of the cost
Real-World Examples
Section titled “Real-World Examples”| Model | Teacher | Student | Result |
|---|---|---|---|
| DistilBERT | BERT-base (110M parameters) | 6-layer Transformer (67M parameters) | 40% smaller than BERT, 1.6× faster, retains 97% of performance[2] |
| Phi-1 / Phi-2 | Synthetic and curated data | 1.3B–2.7B parameters | Small-model example with strong reasoning and coding performance[3] |
| Llama-based derivatives | GPT-4 / ChatGPT (Alpaca etc.) | 7B–13B parameters | Acquired conversational capability |
Distillation Method Comparison
Section titled “Distillation Method Comparison”| Approach | Implementation complexity | Quality tendency | Primary use |
|---|---|---|---|
| Response-based (soft labels) | Low | Easier to preserve on clear tasks | Classification, general tasks |
| Feature-based | Medium–high | Can align intermediate representations | Quality-critical fine-tuning |
| Relation-based | High | Can preserve relationships between data points | Metric learning, embeddings |
| Data distillation (synthetic) | Low–medium | Domain-dependent | LLM specialization, local deployment |
When to Use Knowledge Distillation
Section titled “When to Use Knowledge Distillation”Knowledge distillation is well-suited for:
- Fast inference required: Edge devices, real-time applications, latency-sensitive APIs
- Edge deployment: Running models on smartphones, IoT devices, or in offline environments
- Cost reduction: Reducing API costs for running large models in production
- Privacy preservation: Processing data locally without relying on cloud APIs
Comparison with Fine-tuning
Section titled “Comparison with Fine-tuning”| Aspect | Fine-tuning | Knowledge distillation |
|---|---|---|
| Goal | Adapt model behavior or domain | Make the model smaller and faster |
| Output model size | Same as base model | Smaller than base model |
| Implementation difficulty | Medium | Medium–high (especially feature-based) |
| Resource requirements | Medium (low with PEFT) | Medium (includes teacher inference cost) |
| Best for | Task- or domain-specific adaptation | Deployment cost and speed optimization |
Fine-tuning and knowledge distillation are not mutually exclusive. Generating synthetic data with a teacher model and then fine-tuning a student model on that data is a common approach that combines both techniques. PEFT methods can also be useful when adapting or managing smaller distilled models.[4]
Q: Is knowledge distillation the same as fine-tuning?
A: No. Fine-tuning adapts a model’s behavior to a specific task or domain. Knowledge distillation transfers knowledge from a large teacher model to a smaller student model, primarily to achieve size and speed reduction. That said, data distillation (synthetic data generation) is often used as a fine-tuning method and combines both.
Q: How much smaller can I go?
A: It depends on the task and use case. DistilBERT reported a 40% size reduction, 60% faster inference, and 97% performance retention compared with BERT.[2] For LLM data distillation, results depend heavily on the target domain and evaluation set, and reduced generality is part of the tradeoff.
Q: What is the quality loss?
A: It varies significantly by task and distillation method. For single-domain specialized tasks, quality loss can be nearly imperceptible. For broad reasoning tasks requiring general capability, higher compression ratios typically lead to larger quality drops. Measuring on a task-specific evaluation set is the only reliable way to know.
References
Section titled “References”- Distilling the Knowledge in a Neural Network — Hinton et al.’s original knowledge distillation paper (2015)
- DistilBERT, a distilled version of BERT — DistilBERT paper (Sanh et al., 2019)
- Phi-2: The surprising power of small language models — Microsoft’s data distillation case study
- Hugging Face PEFT Documentation — PEFT library (also useful for managing distilled models)