Learning Paradigms

Learning paradigms are the principles and design philosophies that govern how a machine learning model acquires and applies knowledge. This page focuses on the family of techniques that efficiently learn new tasks by reusing existing knowledge.

Why “Reuse” Matters

Training a model from scratch requires large amounts of data and significant compute. However, when relevant knowledge already exists, reusing it can achieve high performance with far less data and cost.

Analogy: Imagine someone who speaks English learning French. The strategies for language learning — how to memorize vocabulary, how to approach grammar, how to practice pronunciation — are already in place. Only the parts specific to French need focused study. This is the essence of learning paradigms.

graph LR
    subgraph TL["Transfer Learning"]
        BM["Base Model\n(weights frozen)"] --> NL["New Layer\n(trained on new data)"]
    end
    subgraph FT["Fine-tuning"]
        PTM["Pre-trained\nModel (full)"] --> ND["Adjusted with\nnew data"]
    end

1. Transfer Learning

Transfer learning is a technique that applies knowledge from a model trained on one task to a different but related task.

How It Works

Prepare a base model: Obtain a model already trained on large data (e.g., an ImageNet classification model)
Freeze the weights: Lock the parameters (weights) of the base model so they are not updated
Add and train new layers: Train only the new task-specific final layers on the new dataset (using backpropagation)

graph TD
    D1["Large training dataset\n(e.g., ImageNet: 1M+ images)"] --> BM["Base Model\n(pre-trained)"]
    BM --> |weights frozen| FM1["Feature extraction layers\n(unchanged)"]
    FM1 --> NL["New task layer\n(trained on small dataset)"]
    D2["New task data\n(e.g., medical images: 1,000)"] --> NL

When to Use It

When data for the new task is scarce (hundreds to a few thousand samples)
When the original task and the new task are related (both image recognition, both text processing, etc.)

Concrete Example

Use an image classification model trained on ImageNet (1M+ general images) to build a diagnostic model for medical X-ray images (1,000 images). General visual features such as edge detection and shape recognition are reused; only the judgment criteria specific to medical images are learned from scratch.

2. Fine-tuning

Fine-tuning is a technique that re-adjusts the parameters of all layers (or a selected subset) of a pre-trained model using new data.

Difference From Transfer Learning

	Transfer Learning	Fine-tuning
Weights updated	New layers only	All layers (or lower layers included)
Data requirement	Can work with small amounts	More data is preferable
Compute cost	Low	Higher
Overfitting risk	Low	High if data is scarce

How It Works

The entire pre-trained model is trained further on new data. A common effective strategy is discriminative learning rates: lower learning rates for lower layers (which learned general features) and higher learning rates for upper layers (which learned task-specific features).

Concrete Example

Fine-tune a general-purpose LLM such as GPT-4 on a company’s customer support conversation data to build a dialogue model specialized for that company’s products and services.

3. Multitask Learning

Multitask learning is a model design approach in which multiple different tasks are learned simultaneously.

How It Works

graph TD
    IN["Input Data"] --> SH["Shared Network"]
    SH --> T1["Task 1 Branch\n(e.g., sentiment analysis)"]
    SH --> T2["Task 2 Branch\n(e.g., topic classification)"]
    SH --> T3["Task 3 Branch\n(e.g., language detection)"]
    T1 --> O1["Output 1"]
    T2 --> O2["Output 2"]
    T3 --> O3["Output 3"]

Shared network: Learns general representations common across all tasks
Task-specific branches: Generate the output for each individual task

Advantages

Knowledge learned from each task complements the others, improving generalization beyond what individual training achieves
Reduces the cost of training and managing separate models for each task
Tasks with limited data can benefit from the learning signal provided by other tasks

Concrete Example

In natural language processing, training sentiment analysis, topic classification, and language detection together in a single model allows the general language understanding learned for each task to be shared across all tasks.

4. Federated Learning

Federated learning is a technique for training a model on data distributed across many devices without ever centralizing that data on a single server.

How It Works

graph TD
    C["Central Server\n(distributes and aggregates model weights)"]
    D1["Device 1\n(smartphone)\nLocal training"]
    D2["Device 2\n(smartphone)\nLocal training"]
    D3["Device 3\n(smartphone)\nLocal training"]
    C --> |distribute shared model| D1
    C --> |distribute shared model| D2
    C --> |distribute shared model| D3
    D1 --> |send weight updates only| C
    D2 --> |send weight updates only| C
    D3 --> |send weight updates only| C

The central server distributes the shared model to each device
Each device trains locally on its own data (data never leaves the device)
Only the weight updates (gradients) from each device are sent to the server
The server aggregates the updates and improves the shared model
Steps 1–4 repeat

Advantages

Privacy protection: Personal data does not leave the device, reducing privacy risk
Reduced communication cost: Weight updates rather than raw data are transmitted, keeping bandwidth low
Personalization: Each device can be optimized based on its own local data

Concrete Example

The keyboard prediction feature on a smartphone. Each user’s input history (personal data) stays on the device. Only the learning results (weight updates) are aggregated to improve the shared model.

Comparison of the Four Techniques

	Transfer Learning	Fine-tuning	Multitask Learning	Federated Learning
Primary goal	Adapting to a new task with limited data	Domain adaptation and performance gains	Simultaneous multi-task learning	Privacy-preserving distributed learning
Data requirement	Small amounts acceptable	More data preferable	Data for each task needed	Distributed data
Compute cost	Low	Medium–High	Medium	Low per device (distributed)
Typical use case	Repurposing an image classifier	LLM domain adaptation	Multi-purpose NLP models	Smartphones, IoT

FAQ

Q: Are transfer learning and fine-tuning separate techniques?

A: They are related but distinct. Transfer learning is the broader concept of applying knowledge from a trained model to a new task. Fine-tuning is one way to achieve this — by re-adjusting all model parameters on new data. Some texts distinguish fine-tuning (updating all layers) from transfer learning (updating only the final layer), while others treat fine-tuning as a subset of transfer learning.

Q: How much data does LLM fine-tuning require?

A: It depends on the task and quality requirements, but a few hundred to a few thousand high-quality examples can be enough for task-specific adaptation. Techniques like LoRA (Low-Rank Adaptation) enable effective fine-tuning by updating only a small fraction of parameters.

Q: Does federated learning fully protect data privacy?

A: Not completely. Attacks that reconstruct original data from gradient updates (gradient inversion attacks) exist. Complete privacy protection requires combining federated learning with additional techniques such as Differential Privacy.

Previous: What Is Deep Learning?

Link to this page (Japanese): 学習パラダイム