Fine-tuning Methods Compared
About 10 minutes
Fine-tuning is the process of taking a pre-trained LLM and continuing to train it on a specific task or domain. Rather than starting from scratch, fine-tuning uses an already language-capable model as the starting point, making it possible to reach target behavior with less data and compute.
Why Fine-tuning Is Needed
Section titled “Why Fine-tuning Is Needed”General-purpose LLMs carry broad knowledge, but additional training is valuable in the following situations:
- Consistently producing output in a specific format (e.g. JSON, medical report templates)
- Accurately handling domain-specific vocabulary and knowledge (e.g. legal documents, manufacturing manuals)
- Establishing a unified response style and tone
- Adjusting fine-grained behavior that prompt engineering alone cannot address
Fine-tuning does come with cost and complexity. Understanding which method fits the use case is the first step.
Full Fine-tuning
Section titled “Full Fine-tuning”Full fine-tuning updates every weight parameter in the model using training data.
How It Works
Section titled “How It Works”All billions of parameters are updated through backpropagation. This method achieves the highest fidelity to the training data and, in theory, the best possible result quality.
Challenges
Section titled “Challenges”- GPU memory: Full fine-tuning a 7B-parameter model requires tens of gigabytes of GPU memory
- Catastrophic forgetting: Training on new data can cause the model to lose previously acquired general capabilities
- Cost: For large models, compute costs can reach thousands to tens of thousands of dollars
Full fine-tuning is best suited for production systems where quality is the top priority and sufficient compute resources and high-quality data are available.
Parameter-Efficient Fine-Tuning (PEFT)
Section titled “Parameter-Efficient Fine-Tuning (PEFT)”PEFT (Parameter-Efficient Fine-Tuning) refers to a family of methods that freeze most of the model’s parameters and train only a small set of additional parameters.
PEFT emerged because full fine-tuning became impractical as models grew larger. Research has shown that PEFT methods can achieve quality comparable to full fine-tuning at a fraction of the compute cost.
The main PEFT methods are LoRA, QLoRA, and Adapter layers.
LoRA (Low-Rank Adaptation)
Section titled “LoRA (Low-Rank Adaptation)”LoRA (Low-Rank Adaptation) freezes the base model weights and adds small adapter matrices at the attention layers, training only those matrices.
Analogy
Section titled “Analogy”Instead of repainting the entire car (full fine-tuning), LoRA is like adding stickers or decals. The car itself (base model) stays the same; only the add-ons change its appearance or function.
How It Works
Section titled “How It Works”graph LR
I["Input"] --> B["Base model (frozen)"]
I --> LA["LoRA adapter matrix A (low-rank)"]
LA --> LB["LoRA adapter matrix B (low-rank)"]
B --> ADD["Sum"]
LB --> ADD
ADD --> O["Output"]LoRA approximates the weight update $\Delta W$ as the product of two low-rank matrices $A \times B$. If the original weight matrix is $d \times d$, then $\Delta W$ is decomposed into $d \times r$ and $r \times d$ (where $r \ll d$). Only these two small matrices are trained.[1]
Characteristics
Section titled “Characteristics”- Trainable parameters are roughly 0.1–1% of the total
- Multiple LoRA adapters can be swapped without changing the base model
- Adapters can be merged into the base model at inference time, adding zero overhead
- Available through the Hugging Face PEFT library[3]
QLoRA (Quantized LoRA)
Section titled “QLoRA (Quantized LoRA)”QLoRA combines LoRA with quantization, compressing the base model weights to 4-bit precision (NF4 format). It was proposed as a way to substantially reduce the memory required to fine-tune large models.[2]
How It Works
Section titled “How It Works”Normally, model weights are stored in float16 (16-bit) or float32 (32-bit). QLoRA stores the base model weights in 4-bit NF4 format while keeping the LoRA adapter matrices in float16 for training.[2]
Characteristics
Section titled “Characteristics”- A 7B-parameter-class model may be feasible on a 24 GB-class GPU
- Achieves quality comparable to LoRA while drastically reducing memory usage
- Minor accuracy loss from quantization is within acceptable range for most use cases
Typical Use Cases
Section titled “Typical Use Cases”- Individual developers or startups customizing models with available hardware
- Running many experiments while keeping cloud compute costs down
Adapter Layers
Section titled “Adapter Layers”Adapter layers insert small fully-connected networks between Transformer blocks, training only those networks while the base model parameters remain frozen.
How It Works
Section titled “How It Works”After each Transformer block, a small module is added: a down-projection, a non-linear activation, and an up-projection. Only the added Adapter modules are trained; the base model is left intact.
Characteristics
Section titled “Characteristics”- The Adapter layer executes at every inference step, adding slight latency
- An older method than LoRA, but still used in multi-task learning
- A single base model can serve multiple tasks by swapping task-specific Adapters
Comparison Table
Section titled “Comparison Table”| Method | Trainable param % | GPU memory | Quality vs full FT | Typical use case |
|---|---|---|---|---|
| Full fine-tuning | 100% | Very high (tens of GB+) | Baseline (best) | Production quality, sufficient resources |
| LoRA | 0.1–1% | Medium (adapter only) | 95–99% | Quick experiments, managing multiple adapters |
| QLoRA | 0.1–1% (4-bit base) | Low (consumer GPU) | 90–97% | GPU-constrained environments, personal use |
| Adapter | 1–5% | Low–medium | 90–97% | Multi-task learning, task switching |
How to Choose
Section titled “How to Choose”graph TD
A["Start fine-tuning"] --> B{"GPU resources?"}
B -->|"Consumer GPU only"| C["QLoRA"]
B -->|"Cloud GPU / mid-range"| D{"Quality vs cost?"}
D -->|"Fast experiments"| E["LoRA"]
D -->|"Production quality required"| F{"Data and budget?"}
F -->|"Sufficient data and budget"| G["Full fine-tuning"]
F -->|"Cost constraints"| EDecision guidelines:
- Quick experiments → LoRA (easy setup, strong quality)
- Limited GPU → QLoRA (runs on consumer hardware)
- Production quality is critical → Full fine-tuning (when data and resources allow)
- One base model for multiple tasks → Adapter (swap per task)
Key Tools
Section titled “Key Tools”- Hugging Face PEFT: Unified API for LoRA, QLoRA, and Adapter methods
- trl (Transformer Reinforcement Learning): Supports SFT (Supervised Fine-Tuning) and RLHF[4]
- Unsloth: 2–5× faster LoRA/QLoRA training with improved memory efficiency
- LlamaFactory: GUI-based tool for configuring and running LoRA training
Q: Is a GPU required for fine-tuning?
A: Full fine-tuning and standard LoRA require a GPU. QLoRA’s 4-bit quantization makes 7B-parameter-class models feasible on 24 GB-class GPUs in some setups.[2] Training on CPU alone is theoretically possible but takes impractically long.
Q: How many training examples do I need?
A: As few as 100–500 high-quality examples can produce noticeable results for behavior tweaks. Domain adaptation typically requires 1,000–10,000 examples. Quality matters more than quantity — 1,000 well-curated examples generally outperform 10,000 noisy ones.
Q: What happens if I overtrain?
A: The model begins to “memorize” the training data and loses the ability to generalize to new inputs. In practice, it may only handle phrasing that appears in training data, or quality drops sharply for patterns not seen during training. Monitor validation loss and stop training when it starts rising.
References
Section titled “References”- LoRA: Low-Rank Adaptation of Large Language Models — LoRA original paper (Hu et al., 2021)
- QLoRA: Efficient Finetuning of Quantized LLMs — QLoRA original paper (Dettmers et al., 2023)
- Hugging Face PEFT Documentation — Official PEFT library documentation
- Hugging Face trl Documentation — SFT and RLHF training library