Fine-tuning Methods Compared

About 10 minutes

Basics covered in Fine-tuning Introduction

Fine-tuning is the process of taking a pre-trained LLM and continuing to train it on a specific task or domain. Rather than starting from scratch, fine-tuning uses an already language-capable model as the starting point, making it possible to reach target behavior with less data and compute.

Why Fine-tuning Is Needed

General-purpose LLMs carry broad knowledge, but additional training is valuable in the following situations:

Consistently producing output in a specific format (e.g. JSON, medical report templates)
Accurately handling domain-specific vocabulary and knowledge (e.g. legal documents, manufacturing manuals)
Establishing a unified response style and tone
Adjusting fine-grained behavior that prompt engineering alone cannot address

Fine-tuning does come with cost and complexity. Understanding which method fits the use case is the first step.

Full Fine-tuning

Full fine-tuning updates every weight parameter in the model using training data.

How It Works

All billions of parameters are updated through backpropagation. This method achieves the highest fidelity to the training data and, in theory, the best possible result quality.

Challenges

GPU memory: Full fine-tuning a 7B-parameter model requires tens of gigabytes of GPU memory
Catastrophic forgetting: Training on new data can cause the model to lose previously acquired general capabilities
Cost: For large models, compute costs can reach thousands to tens of thousands of dollars

Full fine-tuning is best suited for production systems where quality is the top priority and sufficient compute resources and high-quality data are available.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT (Parameter-Efficient Fine-Tuning) refers to a family of methods that freeze most of the model’s parameters and train only a small set of additional parameters.

PEFT emerged because full fine-tuning became impractical as models grew larger. Research has shown that PEFT methods can achieve quality comparable to full fine-tuning at a fraction of the compute cost.

The main PEFT methods are LoRA, QLoRA, and Adapter layers.

LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) freezes the base model weights and adds small adapter matrices at the attention layers, training only those matrices.

Analogy

Instead of repainting the entire car (full fine-tuning), LoRA is like adding stickers or decals. The car itself (base model) stays the same; only the add-ons change its appearance or function.

How It Works

graph LR
    I["Input"] --> B["Base model (frozen)"]
    I --> LA["LoRA adapter matrix A (low-rank)"]
    LA --> LB["LoRA adapter matrix B (low-rank)"]
    B --> ADD["Sum"]
    LB --> ADD
    ADD --> O["Output"]

LoRA approximates the weight update $\Delta W$ as the product of two low-rank matrices $A \times B$. If the original weight matrix is $d \times d$, then $\Delta W$ is decomposed into $d \times r$ and $r \times d$ (where $r \ll d$). Only these two small matrices are trained.[1]

Characteristics

Trainable parameters are roughly 0.1–1% of the total
Multiple LoRA adapters can be swapped without changing the base model
Adapters can be merged into the base model at inference time, adding zero overhead
Available through the Hugging Face PEFT library[3]

QLoRA (Quantized LoRA)

QLoRA combines LoRA with quantization, compressing the base model weights to 4-bit precision (NF4 format). It was proposed as a way to substantially reduce the memory required to fine-tune large models.[2]

How It Works

Normally, model weights are stored in float16 (16-bit) or float32 (32-bit). QLoRA stores the base model weights in 4-bit NF4 format while keeping the LoRA adapter matrices in float16 for training.[2]

Characteristics

A 7B-parameter-class model may be feasible on a 24 GB-class GPU
Achieves quality comparable to LoRA while drastically reducing memory usage
Minor accuracy loss from quantization is within acceptable range for most use cases

Typical Use Cases

Individual developers or startups customizing models with available hardware
Running many experiments while keeping cloud compute costs down

Adapter Layers

Adapter layers insert small fully-connected networks between Transformer blocks, training only those networks while the base model parameters remain frozen.

How It Works

After each Transformer block, a small module is added: a down-projection, a non-linear activation, and an up-projection. Only the added Adapter modules are trained; the base model is left intact.

Characteristics

The Adapter layer executes at every inference step, adding slight latency
An older method than LoRA, but still used in multi-task learning
A single base model can serve multiple tasks by swapping task-specific Adapters

Comparison Table

Method	Trainable param %	GPU memory	Quality vs full FT	Typical use case
Full fine-tuning	100%	Very high (tens of GB+)	Baseline (best)	Production quality, sufficient resources
LoRA	0.1–1%	Medium (adapter only)	95–99%	Quick experiments, managing multiple adapters
QLoRA	0.1–1% (4-bit base)	Low (consumer GPU)	90–97%	GPU-constrained environments, personal use
Adapter	1–5%	Low–medium	90–97%	Multi-task learning, task switching

How to Choose

graph TD
    A["Start fine-tuning"] --> B{"GPU resources?"}
    B -->|"Consumer GPU only"| C["QLoRA"]
    B -->|"Cloud GPU / mid-range"| D{"Quality vs cost?"}
    D -->|"Fast experiments"| E["LoRA"]
    D -->|"Production quality required"| F{"Data and budget?"}
    F -->|"Sufficient data and budget"| G["Full fine-tuning"]
    F -->|"Cost constraints"| E

Decision guidelines:

Quick experiments → LoRA (easy setup, strong quality)
Limited GPU → QLoRA (runs on consumer hardware)
Production quality is critical → Full fine-tuning (when data and resources allow)
One base model for multiple tasks → Adapter (swap per task)

Key Tools

Hugging Face PEFT: Unified API for LoRA, QLoRA, and Adapter methods
trl (Transformer Reinforcement Learning): Supports SFT (Supervised Fine-Tuning) and RLHF[4]
Unsloth: 2–5× faster LoRA/QLoRA training with improved memory efficiency
LlamaFactory: GUI-based tool for configuring and running LoRA training

FAQ

Q: Is a GPU required for fine-tuning?

A: Full fine-tuning and standard LoRA require a GPU. QLoRA’s 4-bit quantization makes 7B-parameter-class models feasible on 24 GB-class GPUs in some setups.[2] Training on CPU alone is theoretically possible but takes impractically long.

Q: How many training examples do I need?

A: As few as 100–500 high-quality examples can produce noticeable results for behavior tweaks. Domain adaptation typically requires 1,000–10,000 examples. Quality matters more than quantity — 1,000 well-curated examples generally outperform 10,000 noisy ones.

Q: What happens if I overtrain?

A: The model begins to “memorize” the training data and loses the ability to generalize to new inputs. In practice, it may only handle phrasing that appears in training data, or quality drops sharply for patterns not seen during training. Monitor validation loss and stop training when it starts rising.

References

LoRA: Low-Rank Adaptation of Large Language Models — LoRA original paper (Hu et al., 2021)
QLoRA: Efficient Finetuning of Quantized LLMs — QLoRA original paper (Dettmers et al., 2023)
Hugging Face PEFT Documentation — Official PEFT library documentation
Hugging Face trl Documentation — SFT and RLHF training library

Training Data Preparation

What is Fine-tuning?