Learning Paradigms
Learning paradigms are the principles and design philosophies that govern how a machine learning model acquires and applies knowledge. This page focuses on the family of techniques that efficiently learn new tasks by reusing existing knowledge.
Why “Reuse” Matters
Section titled “Why “Reuse” Matters”Training a model from scratch requires large amounts of data and significant compute. However, when relevant knowledge already exists, reusing it can achieve high performance with far less data and cost.
Analogy: Imagine someone who speaks English learning French. The strategies for language learning — how to memorize vocabulary, how to approach grammar, how to practice pronunciation — are already in place. Only the parts specific to French need focused study. This is the essence of learning paradigms.
graph LR
subgraph TL["Transfer Learning"]
BM["Base Model\n(weights frozen)"] --> NL["New Layer\n(trained on new data)"]
end
subgraph FT["Fine-tuning"]
PTM["Pre-trained\nModel (full)"] --> ND["Adjusted with\nnew data"]
end1. Transfer Learning
Section titled “1. Transfer Learning”Transfer learning is a technique that applies knowledge from a model trained on one task to a different but related task.
How It Works
Section titled “How It Works”- Prepare a base model: Obtain a model already trained on large data (e.g., an ImageNet classification model)
- Freeze the weights: Lock the parameters (weights) of the base model so they are not updated
- Add and train new layers: Train only the new task-specific final layers on the new dataset (using backpropagation)
graph TD
D1["Large training dataset\n(e.g., ImageNet: 1M+ images)"] --> BM["Base Model\n(pre-trained)"]
BM --> |weights frozen| FM1["Feature extraction layers\n(unchanged)"]
FM1 --> NL["New task layer\n(trained on small dataset)"]
D2["New task data\n(e.g., medical images: 1,000)"] --> NLWhen to Use It
Section titled “When to Use It”- When data for the new task is scarce (hundreds to a few thousand samples)
- When the original task and the new task are related (both image recognition, both text processing, etc.)
Concrete Example
Section titled “Concrete Example”Use an image classification model trained on ImageNet (1M+ general images) to build a diagnostic model for medical X-ray images (1,000 images). General visual features such as edge detection and shape recognition are reused; only the judgment criteria specific to medical images are learned from scratch.
2. Fine-tuning
Section titled “2. Fine-tuning”Fine-tuning is a technique that re-adjusts the parameters of all layers (or a selected subset) of a pre-trained model using new data.
Difference From Transfer Learning
Section titled “Difference From Transfer Learning”| Transfer Learning | Fine-tuning | |
|---|---|---|
| Weights updated | New layers only | All layers (or lower layers included) |
| Data requirement | Can work with small amounts | More data is preferable |
| Compute cost | Low | Higher |
| Overfitting risk | Low | High if data is scarce |
How It Works
Section titled “How It Works”The entire pre-trained model is trained further on new data. A common effective strategy is discriminative learning rates: lower learning rates for lower layers (which learned general features) and higher learning rates for upper layers (which learned task-specific features).
Concrete Example
Section titled “Concrete Example”Fine-tune a general-purpose LLM such as GPT-4 on a company’s customer support conversation data to build a dialogue model specialized for that company’s products and services.
3. Multitask Learning
Section titled “3. Multitask Learning”Multitask learning is a model design approach in which multiple different tasks are learned simultaneously.
How It Works
Section titled “How It Works”graph TD
IN["Input Data"] --> SH["Shared Network"]
SH --> T1["Task 1 Branch\n(e.g., sentiment analysis)"]
SH --> T2["Task 2 Branch\n(e.g., topic classification)"]
SH --> T3["Task 3 Branch\n(e.g., language detection)"]
T1 --> O1["Output 1"]
T2 --> O2["Output 2"]
T3 --> O3["Output 3"]- Shared network: Learns general representations common across all tasks
- Task-specific branches: Generate the output for each individual task
Advantages
Section titled “Advantages”- Knowledge learned from each task complements the others, improving generalization beyond what individual training achieves
- Reduces the cost of training and managing separate models for each task
- Tasks with limited data can benefit from the learning signal provided by other tasks
Concrete Example
Section titled “Concrete Example”In natural language processing, training sentiment analysis, topic classification, and language detection together in a single model allows the general language understanding learned for each task to be shared across all tasks.
4. Federated Learning
Section titled “4. Federated Learning”Federated learning is a technique for training a model on data distributed across many devices without ever centralizing that data on a single server.
How It Works
Section titled “How It Works”graph TD
C["Central Server\n(distributes and aggregates model weights)"]
D1["Device 1\n(smartphone)\nLocal training"]
D2["Device 2\n(smartphone)\nLocal training"]
D3["Device 3\n(smartphone)\nLocal training"]
C --> |distribute shared model| D1
C --> |distribute shared model| D2
C --> |distribute shared model| D3
D1 --> |send weight updates only| C
D2 --> |send weight updates only| C
D3 --> |send weight updates only| C- The central server distributes the shared model to each device
- Each device trains locally on its own data (data never leaves the device)
- Only the weight updates (gradients) from each device are sent to the server
- The server aggregates the updates and improves the shared model
- Steps 1–4 repeat
Advantages
Section titled “Advantages”- Privacy protection: Personal data does not leave the device, reducing privacy risk
- Reduced communication cost: Weight updates rather than raw data are transmitted, keeping bandwidth low
- Personalization: Each device can be optimized based on its own local data
Concrete Example
Section titled “Concrete Example”The keyboard prediction feature on a smartphone. Each user’s input history (personal data) stays on the device. Only the learning results (weight updates) are aggregated to improve the shared model.
Comparison of the Four Techniques
Section titled “Comparison of the Four Techniques”| Transfer Learning | Fine-tuning | Multitask Learning | Federated Learning | |
|---|---|---|---|---|
| Primary goal | Adapting to a new task with limited data | Domain adaptation and performance gains | Simultaneous multi-task learning | Privacy-preserving distributed learning |
| Data requirement | Small amounts acceptable | More data preferable | Data for each task needed | Distributed data |
| Compute cost | Low | Medium–High | Medium | Low per device (distributed) |
| Typical use case | Repurposing an image classifier | LLM domain adaptation | Multi-purpose NLP models | Smartphones, IoT |
Q: Are transfer learning and fine-tuning separate techniques?
A: They are related but distinct. Transfer learning is the broader concept of applying knowledge from a trained model to a new task. Fine-tuning is one way to achieve this — by re-adjusting all model parameters on new data. Some texts distinguish fine-tuning (updating all layers) from transfer learning (updating only the final layer), while others treat fine-tuning as a subset of transfer learning.
Q: How much data does LLM fine-tuning require?
A: It depends on the task and quality requirements, but a few hundred to a few thousand high-quality examples can be enough for task-specific adaptation. Techniques like LoRA (Low-Rank Adaptation) enable effective fine-tuning by updating only a small fraction of parameters.
Q: Does federated learning fully protect data privacy?
A: Not completely. Attacks that reconstruct original data from gradient updates (gradient inversion attacks) exist. Complete privacy protection requires combining federated learning with additional techniques such as Differential Privacy.
Previous: What Is Deep Learning?
Link to this page (Japanese): 学習パラダイム