Skip to content
X

Learning Paradigms

Learning paradigms are the principles and design philosophies that govern how a machine learning model acquires and applies knowledge. This page focuses on the family of techniques that efficiently learn new tasks by reusing existing knowledge.

Training a model from scratch requires large amounts of data and significant compute. However, when relevant knowledge already exists, reusing it can achieve high performance with far less data and cost.

Analogy: Imagine someone who speaks English learning French. The strategies for language learning — how to memorize vocabulary, how to approach grammar, how to practice pronunciation — are already in place. Only the parts specific to French need focused study. This is the essence of learning paradigms.

graph LR
    subgraph TL["Transfer Learning"]
        BM["Base Model\n(weights frozen)"] --> NL["New Layer\n(trained on new data)"]
    end
    subgraph FT["Fine-tuning"]
        PTM["Pre-trained\nModel (full)"] --> ND["Adjusted with\nnew data"]
    end

Transfer learning is a technique that applies knowledge from a model trained on one task to a different but related task.

  1. Prepare a base model: Obtain a model already trained on large data (e.g., an ImageNet classification model)
  2. Freeze the weights: Lock the parameters (weights) of the base model so they are not updated
  3. Add and train new layers: Train only the new task-specific final layers on the new dataset (using backpropagation)
graph TD
    D1["Large training dataset\n(e.g., ImageNet: 1M+ images)"] --> BM["Base Model\n(pre-trained)"]
    BM --> |weights frozen| FM1["Feature extraction layers\n(unchanged)"]
    FM1 --> NL["New task layer\n(trained on small dataset)"]
    D2["New task data\n(e.g., medical images: 1,000)"] --> NL
  • When data for the new task is scarce (hundreds to a few thousand samples)
  • When the original task and the new task are related (both image recognition, both text processing, etc.)

Use an image classification model trained on ImageNet (1M+ general images) to build a diagnostic model for medical X-ray images (1,000 images). General visual features such as edge detection and shape recognition are reused; only the judgment criteria specific to medical images are learned from scratch.

Fine-tuning is a technique that re-adjusts the parameters of all layers (or a selected subset) of a pre-trained model using new data.

Transfer LearningFine-tuning
Weights updatedNew layers onlyAll layers (or lower layers included)
Data requirementCan work with small amountsMore data is preferable
Compute costLowHigher
Overfitting riskLowHigh if data is scarce

The entire pre-trained model is trained further on new data. A common effective strategy is discriminative learning rates: lower learning rates for lower layers (which learned general features) and higher learning rates for upper layers (which learned task-specific features).

Fine-tune a general-purpose LLM such as GPT-4 on a company’s customer support conversation data to build a dialogue model specialized for that company’s products and services.

Multitask learning is a model design approach in which multiple different tasks are learned simultaneously.

graph TD
    IN["Input Data"] --> SH["Shared Network"]
    SH --> T1["Task 1 Branch\n(e.g., sentiment analysis)"]
    SH --> T2["Task 2 Branch\n(e.g., topic classification)"]
    SH --> T3["Task 3 Branch\n(e.g., language detection)"]
    T1 --> O1["Output 1"]
    T2 --> O2["Output 2"]
    T3 --> O3["Output 3"]
  • Shared network: Learns general representations common across all tasks
  • Task-specific branches: Generate the output for each individual task
  • Knowledge learned from each task complements the others, improving generalization beyond what individual training achieves
  • Reduces the cost of training and managing separate models for each task
  • Tasks with limited data can benefit from the learning signal provided by other tasks

In natural language processing, training sentiment analysis, topic classification, and language detection together in a single model allows the general language understanding learned for each task to be shared across all tasks.

Federated learning is a technique for training a model on data distributed across many devices without ever centralizing that data on a single server.

graph TD
    C["Central Server\n(distributes and aggregates model weights)"]
    D1["Device 1\n(smartphone)\nLocal training"]
    D2["Device 2\n(smartphone)\nLocal training"]
    D3["Device 3\n(smartphone)\nLocal training"]
    C --> |distribute shared model| D1
    C --> |distribute shared model| D2
    C --> |distribute shared model| D3
    D1 --> |send weight updates only| C
    D2 --> |send weight updates only| C
    D3 --> |send weight updates only| C
  1. The central server distributes the shared model to each device
  2. Each device trains locally on its own data (data never leaves the device)
  3. Only the weight updates (gradients) from each device are sent to the server
  4. The server aggregates the updates and improves the shared model
  5. Steps 1–4 repeat
  • Privacy protection: Personal data does not leave the device, reducing privacy risk
  • Reduced communication cost: Weight updates rather than raw data are transmitted, keeping bandwidth low
  • Personalization: Each device can be optimized based on its own local data

The keyboard prediction feature on a smartphone. Each user’s input history (personal data) stays on the device. Only the learning results (weight updates) are aggregated to improve the shared model.

Transfer LearningFine-tuningMultitask LearningFederated Learning
Primary goalAdapting to a new task with limited dataDomain adaptation and performance gainsSimultaneous multi-task learningPrivacy-preserving distributed learning
Data requirementSmall amounts acceptableMore data preferableData for each task neededDistributed data
Compute costLowMedium–HighMediumLow per device (distributed)
Typical use caseRepurposing an image classifierLLM domain adaptationMulti-purpose NLP modelsSmartphones, IoT

Q: Are transfer learning and fine-tuning separate techniques?

A: They are related but distinct. Transfer learning is the broader concept of applying knowledge from a trained model to a new task. Fine-tuning is one way to achieve this — by re-adjusting all model parameters on new data. Some texts distinguish fine-tuning (updating all layers) from transfer learning (updating only the final layer), while others treat fine-tuning as a subset of transfer learning.

Q: How much data does LLM fine-tuning require?

A: It depends on the task and quality requirements, but a few hundred to a few thousand high-quality examples can be enough for task-specific adaptation. Techniques like LoRA (Low-Rank Adaptation) enable effective fine-tuning by updating only a small fraction of parameters.

Q: Does federated learning fully protect data privacy?

A: Not completely. Attacks that reconstruct original data from gradient updates (gradient inversion attacks) exist. Complete privacy protection requires combining federated learning with additional techniques such as Differential Privacy.


Previous: What Is Deep Learning?

Link to this page (Japanese): 学習パラダイム