Skip to content
LinkedInX

Fine-tuning Methods Compared

About 10 minutes

Prerequisites: Basics covered in Fine-tuning Introduction

Fine-tuning is the process of taking a pre-trained LLM and continuing to train it on a specific task or domain. Rather than starting from scratch, fine-tuning uses an already language-capable model as the starting point, making it possible to reach target behavior with less data and compute.

General-purpose LLMs carry broad knowledge, but additional training is valuable in the following situations:

  • Consistently producing output in a specific format (e.g. JSON, medical report templates)
  • Accurately handling domain-specific vocabulary and knowledge (e.g. legal documents, manufacturing manuals)
  • Establishing a unified response style and tone
  • Adjusting fine-grained behavior that prompt engineering alone cannot address

Fine-tuning does come with cost and complexity. Understanding which method fits the use case is the first step.

Full fine-tuning updates every weight parameter in the model using training data.

All billions of parameters are updated through backpropagation. This method achieves the highest fidelity to the training data and, in theory, the best possible result quality.

  • GPU memory: Full fine-tuning a 7B-parameter model requires tens of gigabytes of GPU memory
  • Catastrophic forgetting: Training on new data can cause the model to lose previously acquired general capabilities
  • Cost: For large models, compute costs can reach thousands to tens of thousands of dollars

Full fine-tuning is best suited for production systems where quality is the top priority and sufficient compute resources and high-quality data are available.

PEFT (Parameter-Efficient Fine-Tuning) refers to a family of methods that freeze most of the model’s parameters and train only a small set of additional parameters.

PEFT emerged because full fine-tuning became impractical as models grew larger. Research has shown that PEFT methods can achieve quality comparable to full fine-tuning at a fraction of the compute cost.

The main PEFT methods are LoRA, QLoRA, and Adapter layers.

LoRA (Low-Rank Adaptation) freezes the base model weights and adds small adapter matrices at the attention layers, training only those matrices.

Instead of repainting the entire car (full fine-tuning), LoRA is like adding stickers or decals. The car itself (base model) stays the same; only the add-ons change its appearance or function.

graph LR
    I["Input"] --> B["Base model (frozen)"]
    I --> LA["LoRA adapter matrix A (low-rank)"]
    LA --> LB["LoRA adapter matrix B (low-rank)"]
    B --> ADD["Sum"]
    LB --> ADD
    ADD --> O["Output"]

LoRA approximates the weight update $\Delta W$ as the product of two low-rank matrices $A \times B$. If the original weight matrix is $d \times d$, then $\Delta W$ is decomposed into $d \times r$ and $r \times d$ (where $r \ll d$). Only these two small matrices are trained.[1]

  • Trainable parameters are roughly 0.1–1% of the total
  • Multiple LoRA adapters can be swapped without changing the base model
  • Adapters can be merged into the base model at inference time, adding zero overhead
  • Available through the Hugging Face PEFT library[3]

QLoRA combines LoRA with quantization, compressing the base model weights to 4-bit precision (NF4 format). It was proposed as a way to substantially reduce the memory required to fine-tune large models.[2]

Normally, model weights are stored in float16 (16-bit) or float32 (32-bit). QLoRA stores the base model weights in 4-bit NF4 format while keeping the LoRA adapter matrices in float16 for training.[2]

  • A 7B-parameter-class model may be feasible on a 24 GB-class GPU
  • Achieves quality comparable to LoRA while drastically reducing memory usage
  • Minor accuracy loss from quantization is within acceptable range for most use cases
  • Individual developers or startups customizing models with available hardware
  • Running many experiments while keeping cloud compute costs down

Adapter layers insert small fully-connected networks between Transformer blocks, training only those networks while the base model parameters remain frozen.

After each Transformer block, a small module is added: a down-projection, a non-linear activation, and an up-projection. Only the added Adapter modules are trained; the base model is left intact.

  • The Adapter layer executes at every inference step, adding slight latency
  • An older method than LoRA, but still used in multi-task learning
  • A single base model can serve multiple tasks by swapping task-specific Adapters
MethodTrainable param %GPU memoryQuality vs full FTTypical use case
Full fine-tuning100%Very high (tens of GB+)Baseline (best)Production quality, sufficient resources
LoRA0.1–1%Medium (adapter only)95–99%Quick experiments, managing multiple adapters
QLoRA0.1–1% (4-bit base)Low (consumer GPU)90–97%GPU-constrained environments, personal use
Adapter1–5%Low–medium90–97%Multi-task learning, task switching
graph TD
    A["Start fine-tuning"] --> B{"GPU resources?"}
    B -->|"Consumer GPU only"| C["QLoRA"]
    B -->|"Cloud GPU / mid-range"| D{"Quality vs cost?"}
    D -->|"Fast experiments"| E["LoRA"]
    D -->|"Production quality required"| F{"Data and budget?"}
    F -->|"Sufficient data and budget"| G["Full fine-tuning"]
    F -->|"Cost constraints"| E

Decision guidelines:

  • Quick experiments → LoRA (easy setup, strong quality)
  • Limited GPU → QLoRA (runs on consumer hardware)
  • Production quality is critical → Full fine-tuning (when data and resources allow)
  • One base model for multiple tasks → Adapter (swap per task)
  • Hugging Face PEFT: Unified API for LoRA, QLoRA, and Adapter methods
  • trl (Transformer Reinforcement Learning): Supports SFT (Supervised Fine-Tuning) and RLHF[4]
  • Unsloth: 2–5× faster LoRA/QLoRA training with improved memory efficiency
  • LlamaFactory: GUI-based tool for configuring and running LoRA training

Q: Is a GPU required for fine-tuning?

A: Full fine-tuning and standard LoRA require a GPU. QLoRA’s 4-bit quantization makes 7B-parameter-class models feasible on 24 GB-class GPUs in some setups.[2] Training on CPU alone is theoretically possible but takes impractically long.

Q: How many training examples do I need?

A: As few as 100–500 high-quality examples can produce noticeable results for behavior tweaks. Domain adaptation typically requires 1,000–10,000 examples. Quality matters more than quantity — 1,000 well-curated examples generally outperform 10,000 noisy ones.

Q: What happens if I overtrain?

A: The model begins to “memorize” the training data and loses the ability to generalize to new inputs. In practice, it may only handle phrasing that appears in training data, or quality drops sharply for patterns not seen during training. Monitor validation loss and stop training when it starts rising.

  1. LoRA: Low-Rank Adaptation of Large Language Models — LoRA original paper (Hu et al., 2021)
  2. QLoRA: Efficient Finetuning of Quantized LLMs — QLoRA original paper (Dettmers et al., 2023)
  3. Hugging Face PEFT Documentation — Official PEFT library documentation
  4. Hugging Face trl Documentation — SFT and RLHF training library