Should I fine-tune or use RAG?

RAG for knowledge the model needs to look up, fine-tuning for behavior the model needs to learn. If the desired change is 'know about my 2026 product catalog', use RAG. If it's 'reply in my company's specific JSON format every time', fine-tune. Most teams need both — fine-tune for tone and structure, RAG for facts.

Low-Rank Adaptation freezes the base model and trains small low-rank matrices that are added to attention weights. You end up with an adapter that's 10-100 MB instead of a full model copy, and you can train it on a single consumer GPU for smaller models.

Quantized LoRA. The base model is loaded in 4-bit precision, dramatically reducing VRAM, while the LoRA adapters are trained in higher precision. QLoRA fine-tunes a 70B model on a single 80GB GPU; a 7B model fits on a 24GB consumer card.

Direct Preference Optimization trains a model directly from preference pairs (chosen vs rejected responses) without a separate reward model. It replaces most of the RLHF pipeline with a single training run and is now the default post-training method for open models.

Can I fine-tune on a laptop?

Yes, for small models. A Mac with 64GB unified memory can QLoRA a 7B model; a 16GB MacBook Pro can do a 3B model. It will be slow, hours to a day, but it works. For production fine-tuning, rent an H100 for a few hours.

AI & ML

Fine-Tuning LLMs in 2026 — LoRA, QLoRA, and DPO Without Losing Your Weekend

📅 May 5, 2026 ⏱️ 14 min read 👤 Masturbyte

Fine-Tuning LLMs in 2026 — LoRA, QLoRA, DPO

Three years ago fine-tuning a language model meant renting eight H100s for a weekend, writing a training loop, and hoping the loss curve didn't explode. Today, you can fine-tune a 7B model on your laptop during a lunch break, or a 70B model on a single GPU for the cost of a nice dinner. The math changed.

This is the shortest honest tour of fine-tuning in 2026 I can write. What the three techniques actually do, when to use which, real VRAM numbers, real costs, and the decision tree I use before starting any fine-tune.

First — Decide Whether To Fine-Tune At All

Most fine-tuning projects should be RAG projects, and most RAG projects should be prompt engineering projects. Use this tree.

✏️

Prompt Engineering

Your change is "tell the model what to do." System prompts, few-shot examples, structured output schemas. Always try this first. It's free and fast.

📚

RAG

Your change is "the model needs access to facts it doesn't have." Retrieval-augmented generation pulls relevant documents into context. Use this for evolving knowledge.

🔧

Fine-Tuning

Your change is "the model needs to behave differently by default." Tone, output format, domain language, task-specific reasoning. The behavior becomes part of the weights.

🎯

Combination

Real systems use all three. Fine-tune for tone and format, RAG for facts, prompts for per-request steering. They're not competing choices.

💡 The cheap test: before fine-tuning, spend 30 minutes trying to solve the problem with a better system prompt and 5 in-context examples. If that closes 80% of the gap, fine-tuning is rarely worth it. If it closes <30%, fine-tuning will help.

LoRA — The Technique That Changed Everything

Full fine-tuning updates every weight in the model. For a 70B model, that's 140GB of fp16 parameters plus gradients plus optimizer state — you're looking at ~560GB of GPU memory. That's why fine-tuning used to be a big-company thing.

LoRA (Low-Rank Adaptation) freezes the base model and trains a small pair of matrices per target layer such that their product has the same shape as the original weight matrix. Typically these matrices are rank 8 to 64, meaning you're training 0.1%–2% of the parameters.

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = LoraConfig(
    r=16,                    # rank — the "width" of the adapters
    lora_alpha=32,           # scaling, conventionally 2x rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
model.print_trainable_parameters()
# trainable params: 25M || all params: 8B || trainable%: 0.31

The trainable parameter count drops by two to three orders of magnitude. Training runs fit in a fraction of the memory, and the output — the LoRA adapter — is a small file (~50–400 MB for typical ranks). You can keep the same base model and swap adapters per customer, per task, per tenant. This was the trick nobody had in 2022.

QLoRA — LoRA With Quantization

LoRA still has to load the full model into GPU memory in its full precision. For a 70B model in fp16, that's ~140GB. You need an 8x H100 node.

QLoRA loads the base model in 4-bit NormalFloat quantization — about 35GB for the same 70B model — while keeping the LoRA adapters in higher precision for stable training. The model never dequantizes back to fp16; matmuls happen by dequantizing on the fly inside each layer.

Model size	Full fine-tune	LoRA (fp16)	QLoRA (4-bit)
7B	~120 GB	~20 GB	~8 GB
13B	~220 GB	~35 GB	~14 GB
30B	~500 GB	~80 GB	~32 GB
70B	~1.1 TB	~180 GB	~45 GB

Numbers are VRAM during training at batch size 1, sequence length 2048. Batch size and sequence length push them up linearly.

The practical impact: a single 24GB consumer card (4090, 3090, 7900 XTX) QLoRAs a 7B model. A single 80GB H100 or MI300X QLoRAs a 70B model. This is roughly one order of magnitude cheaper than full fine-tuning and — in most benchmarks — produces quality within a percent of it.

💡 Default recipe in 2026: QLoRA with rank 16–32, bf16 compute dtype, 4-bit NF4 quantization, double_quant=True, gradient checkpointing on. This covers 90% of practical fine-tunes. The HuggingFace trl library has this wired in SFTTrainer out of the box.

DPO — The Death of the RLHF Pipeline

Supervised fine-tuning (SFT) teaches a model to produce outputs. It doesn't teach it to prefer good outputs over bad ones. That's the preference problem, and it used to require a full RLHF pipeline: train a reward model, do proximal policy optimization against it, debug exploding KL, cry, repeat.

Direct Preference Optimization collapses all of that into a single training run. Instead of learning a reward model and then optimizing against it, DPO directly optimizes the policy using a closed-form equivalence between reward maximization and preference likelihood.

The practical effect: you need pairs of (chosen, rejected) responses to the same prompt. You run a training job with a specific loss. You get a fine-tuned model. The intermediate reward model is gone, the PPO loop is gone, the instability is mostly gone.

from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# Each row: {prompt, chosen, rejected}
dataset = Dataset.from_list([
    {"prompt": "Explain mutexes.", "chosen": "A mutex is a mutual-exclusion lock...",
     "rejected": "Mutex are like locks for threads..."},
    # ... thousands more
])

trainer = DPOTrainer(
    model=model,
    ref_model=None,              # trl derives it from the LoRA adapters
    args=DPOConfig(output_dir="./dpo-out", beta=0.1, num_train_epochs=1),
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()

The typical modern pipeline is SFT → DPO, both done with QLoRA adapters: supervised data teaches the model the task, preference data then teaches it to prefer the right kind of output. Variants (IPO, KTO, ORPO, SimPO) trade off different properties but the structure is the same.

Real Cost Math (2026 Prices)

Spot H100 from a neo-cloud (Lambda, Coreweave, Runpod, Voltage Park): $1.80–$2.50 per hour. On-demand H100: $2.80–$4.00. MI300X on Runpod is often cheaper per memory-GB and now has mature bf16 support.

Fine-tune	Hardware	Wall time	Cost (spot)
QLoRA 7B, 50k samples, 1 epoch	1× H100 80GB	~2–3 hours	$4–$8
QLoRA 13B, 50k samples, 1 epoch	1× H100 80GB	~5–7 hours	$10–$18
QLoRA 70B, 50k samples, 1 epoch	1× H100 80GB	~24–36 hours	$45–$90
Full FT 7B, 50k samples	4× H100 80GB	~4–6 hours	$30–$60
DPO 7B on QLoRA adapter	1× H100 80GB	~1–2 hours	$2–$5

Rule of thumb: if your dataset is under 100k examples and the model is ≤13B, a good QLoRA fine-tune costs less than dinner for two.

Pitfalls That Burn Weekends

Chat templates

Each model family has its own chat template (Llama-3, Qwen, Mistral, Gemma, etc.). If your training data uses a different template than the base model, you will fine-tune degradation into the model. Always call tokenizer.apply_chat_template when preparing training data. Always verify a decoded sample before launching a run.

Catastrophic forgetting isn't a LoRA problem

…until you set the rank too high or the alpha too aggressively. Rank 8–32 rarely damages general capability. Rank 128+ with a large scaling alpha can turn your helpful model into a narrow specialist that forgot how to be useful. Start small.

The data is 90% of the outcome

Nothing in this article matters if your 50k training examples are low quality. 5k curated examples consistently beat 50k scraped ones. Spend a day cleaning before you spend an hour training.

Evaluation needs to be there before training starts

If you can't measure "better," you can't tell if your fine-tune worked. Write your eval set before you write the training config. 100–500 held-out examples with clear grading criteria is enough.

⚠️ Real trap: using your training data as eval. The model will score near-perfectly on it after one epoch and you will deploy a model that generalizes worse than the base. Your eval set must never overlap with training.

The 2026 Fine-Tuning Stack

Framework: HuggingFace trl (SFTTrainer, DPOTrainer) + peft (LoRA/QLoRA configs). Axolotl for YAML-driven recipes. Unsloth for significant speed and memory wins on single-GPU runs.
Infra: a single rented H100 for most teams. If you're routinely burning multi-node jobs, switch to a managed training service (Together, Modal, Replicate, AWS SageMaker HyperPod).
Tracking: Weights & Biases or MLflow. Loss curves are a first-class diagnostic; you want them visible from the start of the run, not pulled out of logs three hours later.
Serving: vLLM or SGLang with adapter hot-swapping. You host one copy of the base model and load different LoRA adapters per request or per tenant.
Eval: a custom held-out set that matches your task, plus a general benchmark (MT-Bench, AlpacaEval, or similar) to catch regressions in general capability.

Takeaways

Fine-tune for behavior, RAG for facts, prompts for per-request steering. Try all three before reaching for training hardware.
QLoRA is the default. Full fine-tuning is reserved for research and for the rare case where QLoRA genuinely underperforms.
SFT then DPO. That's the whole pipeline for most teams.
Budget is measured in dinners, not quarters. Quality bar is measured in eval sets, not vibes.
Data quality and eval rigor do more for your results than any choice of optimizer.