Knowledge Distillation

About 10 minutes

Basics covered in Fine-tuning Introduction

Knowledge distillation is a technique for transferring the knowledge held by a large model (the teacher model) to a smaller model (the student model). Hinton et al.’s original knowledge distillation paper formalized the idea of training a smaller model from the teacher model’s output distribution.[1]

What Is Knowledge Distillation?

Imagine a master chef (teacher model) who doesn’t just hand a recipe book (hard labels) to an apprentice (student model), but actually cooks alongside them — demonstrating the right heat for each ingredient, the texture to look for before moving to the next step — transferring cooking intuition rather than written instructions alone.

In standard supervised learning, a model trains only on hard labels (“this is a cat,” “this is a dog”). In knowledge distillation, the model trains on the teacher’s output probability distributions — for example, “cat: 85%, dog: 12%, tiger: 3%” — learning the full distribution of uncertainty across classes. This distribution is called soft labels, and it contains richer information than hard labels (0/1 correct answers).[1]

graph TD
    I["Input data"] --> T["Teacher model (large, high accuracy)"]
    I --> S["Student model (small, fast)"]
    T --> SL["Soft labels / intermediate features"]
    SL --> DL["Distillation loss"]
    S --> DL
    DL --> U["Update student model weights"]

Three Types of Distillation

1. Response-based Distillation

The student model mimics the final outputs of the teacher model, training on the teacher’s soft labels.

This is the simplest form:

The teacher outputs a probability distribution over all classes
The student outputs a probability distribution for the same input
The student is trained to minimize the difference between the two distributions (KL divergence)

2. Feature-based Distillation

The student model mimics the teacher’s intermediate representations (features from internal layers), going beyond the final output to transfer how the model internally constructs concepts.

This is harder to implement but allows the student to learn from the teacher’s “reasoning process” at a deeper level. DistilBERT uses a triple loss combining language modeling, distillation, and cosine-distance losses.[2]

3. Relation-based Distillation

The student learns relationship patterns between data points — understanding that “sample A and B are similar, while C and D are dissimilar” — transferring structural knowledge rather than outputs or features.

Soft Labels and Temperature Scaling

Standard softmax outputs can become overconfident, concentrating probability on a single class (e.g. cat: 99.9%, dog: 0.1%). In this case, soft labels carry little extra information over hard labels.

Temperature scaling introduces a temperature parameter $T$ into the softmax calculation, smoothing out the probability distribution:

$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

Setting $T > 1$ flattens the distribution, making inter-class similarities more visible. Distillation typically uses $T = 2 \text{–} 4$.

Practical Distillation: Data Distillation

In the LLM era, data distillation (also called synthetic data distillation) is also used: a teacher model generates training data for the student. Microsoft’s Phi-2 post describes training with synthetic datasets and filtered web data.[3]

graph LR
    D["Internal documents / references"] --> T["Teacher model (GPT-4 etc.)"]
    T --> SD["Synthetic Q&A dataset"]
    SD --> S["Fine-tuning the student model"]
    S --> SM["Lightweight, fast specialized model"]

Example workflow:

Use a large teacher model (e.g. GPT-4) to generate Q&A pairs from internal documents
Build a dataset from the generated output (thousands to tens of thousands of examples)
Fine-tune a smaller open-source model (Llama, Mistral, etc.) on that data
Achieve performance close to the teacher on the target domain at a fraction of the cost

Real-World Examples

Model	Teacher	Student	Result
DistilBERT	BERT-base (110M parameters)	6-layer Transformer (67M parameters)	40% smaller than BERT, 1.6× faster, retains 97% of performance[2]
Phi-1 / Phi-2	Synthetic and curated data	1.3B–2.7B parameters	Small-model example with strong reasoning and coding performance[3]
Llama-based derivatives	GPT-4 / ChatGPT (Alpaca etc.)	7B–13B parameters	Acquired conversational capability

Distillation Method Comparison

Approach	Implementation complexity	Quality tendency	Primary use
Response-based (soft labels)	Low	Easier to preserve on clear tasks	Classification, general tasks
Feature-based	Medium–high	Can align intermediate representations	Quality-critical fine-tuning
Relation-based	High	Can preserve relationships between data points	Metric learning, embeddings
Data distillation (synthetic)	Low–medium	Domain-dependent	LLM specialization, local deployment

When to Use Knowledge Distillation

Knowledge distillation is well-suited for:

Fast inference required: Edge devices, real-time applications, latency-sensitive APIs
Edge deployment: Running models on smartphones, IoT devices, or in offline environments
Cost reduction: Reducing API costs for running large models in production
Privacy preservation: Processing data locally without relying on cloud APIs

Comparison with Fine-tuning

Aspect	Fine-tuning	Knowledge distillation
Goal	Adapt model behavior or domain	Make the model smaller and faster
Output model size	Same as base model	Smaller than base model
Implementation difficulty	Medium	Medium–high (especially feature-based)
Resource requirements	Medium (low with PEFT)	Medium (includes teacher inference cost)
Best for	Task- or domain-specific adaptation	Deployment cost and speed optimization

Fine-tuning and knowledge distillation are not mutually exclusive. Generating synthetic data with a teacher model and then fine-tuning a student model on that data is a common approach that combines both techniques. PEFT methods can also be useful when adapting or managing smaller distilled models.[4]

FAQ

Q: Is knowledge distillation the same as fine-tuning?

A: No. Fine-tuning adapts a model’s behavior to a specific task or domain. Knowledge distillation transfers knowledge from a large teacher model to a smaller student model, primarily to achieve size and speed reduction. That said, data distillation (synthetic data generation) is often used as a fine-tuning method and combines both.

Q: How much smaller can I go?

A: It depends on the task and use case. DistilBERT reported a 40% size reduction, 60% faster inference, and 97% performance retention compared with BERT.[2] For LLM data distillation, results depend heavily on the target domain and evaluation set, and reduced generality is part of the tradeoff.

Q: What is the quality loss?

A: It varies significantly by task and distillation method. For single-domain specialized tasks, quality loss can be nearly imperceptible. For broad reasoning tasks requiring general capability, higher compression ratios typically lead to larger quality drops. Measuring on a task-specific evaluation set is the only reliable way to know.

References

Distilling the Knowledge in a Neural Network — Hinton et al.’s original knowledge distillation paper (2015)
DistilBERT, a distilled version of BERT — DistilBERT paper (Sanh et al., 2019)
Phi-2: The surprising power of small language models — Microsoft’s data distillation case study
Hugging Face PEFT Documentation — PEFT library (also useful for managing distilled models)

What Is an AI Agent?

Training Data Preparation