Skip to content
LinkedInX

Knowledge Distillation

About 10 minutes

Prerequisites: Basics covered in Fine-tuning Introduction

Knowledge distillation is a technique for transferring the knowledge held by a large model (the teacher model) to a smaller model (the student model). Hinton et al.’s original knowledge distillation paper formalized the idea of training a smaller model from the teacher model’s output distribution.[1]

Imagine a master chef (teacher model) who doesn’t just hand a recipe book (hard labels) to an apprentice (student model), but actually cooks alongside them — demonstrating the right heat for each ingredient, the texture to look for before moving to the next step — transferring cooking intuition rather than written instructions alone.

In standard supervised learning, a model trains only on hard labels (“this is a cat,” “this is a dog”). In knowledge distillation, the model trains on the teacher’s output probability distributions — for example, “cat: 85%, dog: 12%, tiger: 3%” — learning the full distribution of uncertainty across classes. This distribution is called soft labels, and it contains richer information than hard labels (0/1 correct answers).[1]

graph TD
    I["Input data"] --> T["Teacher model (large, high accuracy)"]
    I --> S["Student model (small, fast)"]
    T --> SL["Soft labels / intermediate features"]
    SL --> DL["Distillation loss"]
    S --> DL
    DL --> U["Update student model weights"]

The student model mimics the final outputs of the teacher model, training on the teacher’s soft labels.

This is the simplest form:

  • The teacher outputs a probability distribution over all classes
  • The student outputs a probability distribution for the same input
  • The student is trained to minimize the difference between the two distributions (KL divergence)

The student model mimics the teacher’s intermediate representations (features from internal layers), going beyond the final output to transfer how the model internally constructs concepts.

This is harder to implement but allows the student to learn from the teacher’s “reasoning process” at a deeper level. DistilBERT uses a triple loss combining language modeling, distillation, and cosine-distance losses.[2]

The student learns relationship patterns between data points — understanding that “sample A and B are similar, while C and D are dissimilar” — transferring structural knowledge rather than outputs or features.

Standard softmax outputs can become overconfident, concentrating probability on a single class (e.g. cat: 99.9%, dog: 0.1%). In this case, soft labels carry little extra information over hard labels.

Temperature scaling introduces a temperature parameter $T$ into the softmax calculation, smoothing out the probability distribution:

$p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$

Setting $T > 1$ flattens the distribution, making inter-class similarities more visible. Distillation typically uses $T = 2 \text{–} 4$.

In the LLM era, data distillation (also called synthetic data distillation) is also used: a teacher model generates training data for the student. Microsoft’s Phi-2 post describes training with synthetic datasets and filtered web data.[3]

graph LR
    D["Internal documents / references"] --> T["Teacher model (GPT-4 etc.)"]
    T --> SD["Synthetic Q&A dataset"]
    SD --> S["Fine-tuning the student model"]
    S --> SM["Lightweight, fast specialized model"]

Example workflow:

  1. Use a large teacher model (e.g. GPT-4) to generate Q&A pairs from internal documents
  2. Build a dataset from the generated output (thousands to tens of thousands of examples)
  3. Fine-tune a smaller open-source model (Llama, Mistral, etc.) on that data
  4. Achieve performance close to the teacher on the target domain at a fraction of the cost
ModelTeacherStudentResult
DistilBERTBERT-base (110M parameters)6-layer Transformer (67M parameters)40% smaller than BERT, 1.6× faster, retains 97% of performance[2]
Phi-1 / Phi-2Synthetic and curated data1.3B–2.7B parametersSmall-model example with strong reasoning and coding performance[3]
Llama-based derivativesGPT-4 / ChatGPT (Alpaca etc.)7B–13B parametersAcquired conversational capability
ApproachImplementation complexityQuality tendencyPrimary use
Response-based (soft labels)LowEasier to preserve on clear tasksClassification, general tasks
Feature-basedMedium–highCan align intermediate representationsQuality-critical fine-tuning
Relation-basedHighCan preserve relationships between data pointsMetric learning, embeddings
Data distillation (synthetic)Low–mediumDomain-dependentLLM specialization, local deployment

Knowledge distillation is well-suited for:

  • Fast inference required: Edge devices, real-time applications, latency-sensitive APIs
  • Edge deployment: Running models on smartphones, IoT devices, or in offline environments
  • Cost reduction: Reducing API costs for running large models in production
  • Privacy preservation: Processing data locally without relying on cloud APIs
AspectFine-tuningKnowledge distillation
GoalAdapt model behavior or domainMake the model smaller and faster
Output model sizeSame as base modelSmaller than base model
Implementation difficultyMediumMedium–high (especially feature-based)
Resource requirementsMedium (low with PEFT)Medium (includes teacher inference cost)
Best forTask- or domain-specific adaptationDeployment cost and speed optimization

Fine-tuning and knowledge distillation are not mutually exclusive. Generating synthetic data with a teacher model and then fine-tuning a student model on that data is a common approach that combines both techniques. PEFT methods can also be useful when adapting or managing smaller distilled models.[4]

Q: Is knowledge distillation the same as fine-tuning?

A: No. Fine-tuning adapts a model’s behavior to a specific task or domain. Knowledge distillation transfers knowledge from a large teacher model to a smaller student model, primarily to achieve size and speed reduction. That said, data distillation (synthetic data generation) is often used as a fine-tuning method and combines both.

Q: How much smaller can I go?

A: It depends on the task and use case. DistilBERT reported a 40% size reduction, 60% faster inference, and 97% performance retention compared with BERT.[2] For LLM data distillation, results depend heavily on the target domain and evaluation set, and reduced generality is part of the tradeoff.

Q: What is the quality loss?

A: It varies significantly by task and distillation method. For single-domain specialized tasks, quality loss can be nearly imperceptible. For broad reasoning tasks requiring general capability, higher compression ratios typically lead to larger quality drops. Measuring on a task-specific evaluation set is the only reliable way to know.

  1. Distilling the Knowledge in a Neural Network — Hinton et al.’s original knowledge distillation paper (2015)
  2. DistilBERT, a distilled version of BERT — DistilBERT paper (Sanh et al., 2019)
  3. Phi-2: The surprising power of small language models — Microsoft’s data distillation case study
  4. Hugging Face PEFT Documentation — PEFT library (also useful for managing distilled models)