Fine-Tuning LLMs in 2026 — LoRA, QLoRA, and DPO Without Losing Your Weekend
Three years ago fine-tuning a language model meant renting eight H100s for a weekend, writing a training loop, and hoping the loss curve didn't explode. Today, you can fine-tune a 7B model on your laptop during a lunch break, or a 70B model on a single GPU for the cost of a nice dinner. The math changed.
This is the shortest honest tour of fine-tuning in 2026 I can write. What the three techniques actually do, when to use which, real VRAM numbers, real costs, and the decision tree I use before starting any fine-tune.
First — Decide Whether To Fine-Tune At All
Most fine-tuning projects should be RAG projects, and most RAG projects should be prompt engineering projects. Use this tree.
Prompt Engineering
Your change is "tell the model what to do." System prompts, few-shot examples, structured output schemas. Always try this first. It's free and fast.
RAG
Your change is "the model needs access to facts it doesn't have." Retrieval-augmented generation pulls relevant documents into context. Use this for evolving knowledge.
Fine-Tuning
Your change is "the model needs to behave differently by default." Tone, output format, domain language, task-specific reasoning. The behavior becomes part of the weights.
Combination
Real systems use all three. Fine-tune for tone and format, RAG for facts, prompts for per-request steering. They're not competing choices.
LoRA — The Technique That Changed Everything
Full fine-tuning updates every weight in the model. For a 70B model, that's 140GB of fp16 parameters plus gradients plus optimizer state — you're looking at ~560GB of GPU memory. That's why fine-tuning used to be a big-company thing.
LoRA (Low-Rank Adaptation) freezes the base model and trains a small pair of matrices per target layer such that their product has the same shape as the original weight matrix. Typically these matrices are rank 8 to 64, meaning you're training 0.1%–2% of the parameters.
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = LoraConfig(
r=16, # rank — the "width" of the adapters
lora_alpha=32, # scaling, conventionally 2x rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 25M || all params: 8B || trainable%: 0.31
The trainable parameter count drops by two to three orders of magnitude. Training runs fit in a fraction of the memory, and the output — the LoRA adapter — is a small file (~50–400 MB for typical ranks). You can keep the same base model and swap adapters per customer, per task, per tenant. This was the trick nobody had in 2022.
QLoRA — LoRA With Quantization
LoRA still has to load the full model into GPU memory in its full precision. For a 70B model in fp16, that's ~140GB. You need an 8x H100 node.
QLoRA loads the base model in 4-bit NormalFloat quantization — about 35GB for the same 70B model — while keeping the LoRA adapters in higher precision for stable training. The model never dequantizes back to fp16; matmuls happen by dequantizing on the fly inside each layer.
| Model size | Full fine-tune | LoRA (fp16) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~120 GB | ~20 GB | ~8 GB |
| 13B | ~220 GB | ~35 GB | ~14 GB |
| 30B | ~500 GB | ~80 GB | ~32 GB |
| 70B | ~1.1 TB | ~180 GB | ~45 GB |
Numbers are VRAM during training at batch size 1, sequence length 2048. Batch size and sequence length push them up linearly.
The practical impact: a single 24GB consumer card (4090, 3090, 7900 XTX) QLoRAs a 7B model. A single 80GB H100 or MI300X QLoRAs a 70B model. This is roughly one order of magnitude cheaper than full fine-tuning and — in most benchmarks — produces quality within a percent of it.
bf16 compute dtype, 4-bit NF4 quantization, double_quant=True, gradient checkpointing on. This covers 90% of practical fine-tunes. The HuggingFace trl library has this wired in SFTTrainer out of the box.
DPO — The Death of the RLHF Pipeline
Supervised fine-tuning (SFT) teaches a model to produce outputs. It doesn't teach it to prefer good outputs over bad ones. That's the preference problem, and it used to require a full RLHF pipeline: train a reward model, do proximal policy optimization against it, debug exploding KL, cry, repeat.
Direct Preference Optimization collapses all of that into a single training run. Instead of learning a reward model and then optimizing against it, DPO directly optimizes the policy using a closed-form equivalence between reward maximization and preference likelihood.
The practical effect: you need pairs of (chosen, rejected) responses to the same prompt. You run a training job with a specific loss. You get a fine-tuned model. The intermediate reward model is gone, the PPO loop is gone, the instability is mostly gone.
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
# Each row: {prompt, chosen, rejected}
dataset = Dataset.from_list([
{"prompt": "Explain mutexes.", "chosen": "A mutex is a mutual-exclusion lock...",
"rejected": "Mutex are like locks for threads..."},
# ... thousands more
])
trainer = DPOTrainer(
model=model,
ref_model=None, # trl derives it from the LoRA adapters
args=DPOConfig(output_dir="./dpo-out", beta=0.1, num_train_epochs=1),
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
The typical modern pipeline is SFT → DPO, both done with QLoRA adapters: supervised data teaches the model the task, preference data then teaches it to prefer the right kind of output. Variants (IPO, KTO, ORPO, SimPO) trade off different properties but the structure is the same.
Real Cost Math (2026 Prices)
Spot H100 from a neo-cloud (Lambda, Coreweave, Runpod, Voltage Park): $1.80–$2.50 per hour. On-demand H100: $2.80–$4.00. MI300X on Runpod is often cheaper per memory-GB and now has mature bf16 support.
| Fine-tune | Hardware | Wall time | Cost (spot) |
|---|---|---|---|
| QLoRA 7B, 50k samples, 1 epoch | 1× H100 80GB | ~2–3 hours | $4–$8 |
| QLoRA 13B, 50k samples, 1 epoch | 1× H100 80GB | ~5–7 hours | $10–$18 |
| QLoRA 70B, 50k samples, 1 epoch | 1× H100 80GB | ~24–36 hours | $45–$90 |
| Full FT 7B, 50k samples | 4× H100 80GB | ~4–6 hours | $30–$60 |
| DPO 7B on QLoRA adapter | 1× H100 80GB | ~1–2 hours | $2–$5 |
Rule of thumb: if your dataset is under 100k examples and the model is ≤13B, a good QLoRA fine-tune costs less than dinner for two.
Pitfalls That Burn Weekends
Chat templates
Each model family has its own chat template (Llama-3, Qwen, Mistral, Gemma, etc.). If your training data uses a different template than the base model, you will fine-tune degradation into the model. Always call tokenizer.apply_chat_template when preparing training data. Always verify a decoded sample before launching a run.
Catastrophic forgetting isn't a LoRA problem
…until you set the rank too high or the alpha too aggressively. Rank 8–32 rarely damages general capability. Rank 128+ with a large scaling alpha can turn your helpful model into a narrow specialist that forgot how to be useful. Start small.
The data is 90% of the outcome
Nothing in this article matters if your 50k training examples are low quality. 5k curated examples consistently beat 50k scraped ones. Spend a day cleaning before you spend an hour training.
Evaluation needs to be there before training starts
If you can't measure "better," you can't tell if your fine-tune worked. Write your eval set before you write the training config. 100–500 held-out examples with clear grading criteria is enough.
The 2026 Fine-Tuning Stack
- Framework: HuggingFace
trl(SFTTrainer, DPOTrainer) +peft(LoRA/QLoRA configs). Axolotl for YAML-driven recipes. Unsloth for significant speed and memory wins on single-GPU runs. - Infra: a single rented H100 for most teams. If you're routinely burning multi-node jobs, switch to a managed training service (Together, Modal, Replicate, AWS SageMaker HyperPod).
- Tracking: Weights & Biases or MLflow. Loss curves are a first-class diagnostic; you want them visible from the start of the run, not pulled out of logs three hours later.
- Serving: vLLM or SGLang with adapter hot-swapping. You host one copy of the base model and load different LoRA adapters per request or per tenant.
- Eval: a custom held-out set that matches your task, plus a general benchmark (MT-Bench, AlpacaEval, or similar) to catch regressions in general capability.
Takeaways
- Fine-tune for behavior, RAG for facts, prompts for per-request steering. Try all three before reaching for training hardware.
- QLoRA is the default. Full fine-tuning is reserved for research and for the rare case where QLoRA genuinely underperforms.
- SFT then DPO. That's the whole pipeline for most teams.
- Budget is measured in dinners, not quarters. Quality bar is measured in eval sets, not vibes.
- Data quality and eval rigor do more for your results than any choice of optimizer.