Skip to content

Run AI Models Locally: The Complete Guide

Server room with glowing lights representing local AI infrastructure

Cloud AI APIs are convenient - until you get the bill, hit a rate limit at 2 AM, or realize your proprietary data is sitting on someone else's servers. Running AI models locally gives you full control: zero ongoing costs, complete privacy, no internet dependency, and the freedom to customize models however you want. This guide covers everything you need to go from zero to running production-grade local AI.

Why Run AI Locally?

The case for local AI has never been stronger. Open-source models like Llama 3.1, Mistral, and DeepSeek now rival GPT-4 class performance for many tasks. Here's why running them on your own hardware makes sense:

Privacy & Data Sovereignty

When you run models locally, your data never leaves your machine. No terms of service changes, no training on your inputs, no compliance headaches. For healthcare, legal, finance, and government use cases, this isn't optional - it's a requirement. Local inference means HIPAA, GDPR, and SOC 2 compliance becomes dramatically simpler because the data boundary is your own hardware.

Cost: $0/Month After Hardware

API costs scale linearly with usage. Local inference has a one-time hardware cost and then runs at the price of electricity. For teams doing heavy inference - RAG pipelines, code generation, batch processing - the payback period is measured in weeks, not years.

Usage Level Monthly Tokens GPT-4o API Cost Claude 3.5 API Cost Local Cost (after hardware)
Light (hobby) 1M tokens ~$7.50 ~$9.00 $0 (electricity ~$2)
Moderate (dev team) 20M tokens ~$150 ~$180 $0 (electricity ~$8)
Heavy (production) 200M tokens ~$1,500 ~$1,800 $0 (electricity ~$30)
Enterprise (batch) 2B tokens ~$15,000 ~$18,000 $0 (electricity ~$90)

API costs based on 2026 pricing. Local electricity estimates assume a single RTX 4090 at $0.12/kWh running under moderate load.

Latency & Offline Capability

Local inference eliminates network round-trips. For interactive applications - code completion, real-time chat, document processing - you get responses in milliseconds instead of waiting for API calls. And it works on an airplane, in a bunker, or anywhere without internet.

No Rate Limits, No Downtime

API providers have rate limits, usage caps, and outages. Your local setup runs as fast as your hardware allows, 24/7, with no throttling. Need to process 10,000 documents overnight? No one's going to cut you off at request #500.

Full Customization

Fine-tune models on your domain data. Adjust system prompts, temperature, and sampling parameters without restrictions. Create custom Modelfiles. Merge models with techniques like DARE or TIES. Run experimental quantizations. You own the entire stack.

When APIs still make sense: If you need frontier-level reasoning (GPT-4o, Claude 3.5 Opus), only process a few hundred requests per day, or don't want to manage hardware, APIs remain the pragmatic choice. Local AI shines when you need volume, privacy, or customization.

Hardware Guide

The single most important factor for local AI performance is VRAM (GPU memory). Models must fit in VRAM for fast inference. If they spill to system RAM, you'll see 5-20x slowdowns. Here's what you need:

VRAM Requirements by Model Size

Model Size FP16 (full) Q8_0 (8-bit) Q4_K_M (4-bit) Recommended GPU
1-3B (Phi-4 Mini, Gemma 2 2B) 2-6 GB 1.5-3.5 GB 1-2 GB Any GPU / CPU-only viable
7-8B (Llama 3.1 8B, Mistral 7B) 14-16 GB 8-9 GB 4.5-5.5 GB RTX 4060 8GB (Q4) / RTX 4070 Ti 16GB
13-14B (CodeLlama 13B) 26-28 GB 14-15 GB 8-9 GB RTX 4090 24GB (Q4-Q5)
34B (DeepSeek Coder V2 Lite) 68 GB 36 GB 20-22 GB RTX 4090 24GB (tight Q4) / RTX 5090 32GB
70B (Llama 3.1 70B) 140 GB 74 GB 40-42 GB 2Ɨ RTX 4090 / 2Ɨ RTX 5090 / 1Ɨ A100 80GB

CPU Inference

CPU inference is viable for models up to ~7B parameters at Q4 quantization. Tools like llama.cpp are heavily optimized for CPU with AVX2/AVX-512 instructions. Expect 5-15 tokens/second on a modern desktop CPU (Ryzen 7/9, Intel 13th/14th gen) for a 7B Q4 model. That's usable for chat but too slow for batch processing. CPU inference is a great starting point if you don't have a GPU - you can always add one later.

Apple Silicon: The Unified Memory Advantage

Apple's M1-M4 chips share memory between CPU and GPU, which means a MacBook Pro with 36GB unified memory can run a 34B Q4 model that would require a dedicated GPU on a PC. The M4 Max with 128GB unified memory can run 70B models comfortably. Performance is roughly 60-70% of an equivalent NVIDIA GPU, but the convenience and power efficiency are unmatched for development use.

  • M1/M2 (8-16GB): 7B models at Q4, good for experimentation
  • M3/M4 Pro (18-36GB): 13B-34B models, solid development machine
  • M4 Max (64-128GB): 70B models, near-workstation performance

NVIDIA vs AMD

NVIDIA dominates local AI due to CUDA ecosystem maturity. Every major tool (Ollama, llama.cpp, vLLM, PyTorch) has first-class CUDA support. AMD's ROCm has improved significantly in 2025-2026, and llama.cpp now runs well on RX 7900 XTX, but you'll still hit compatibility issues with some tools. If you're buying new hardware specifically for AI, go NVIDIA. If you already have a high-end AMD card, it's worth trying - just expect some extra setup.

Recommended Builds

Budget Build (~$500)

  • GPU: Used RTX 3060 12GB (~$180) or RTX 4060 8GB (~$300)
  • CPU: Ryzen 5 5600 or Intel i5-12400
  • RAM: 32GB DDR4
  • Storage: 1TB NVMe SSD (models are 4-40GB each)
  • Runs: 7B models at Q4-Q8, 13B at Q4 (tight on 8GB cards)

Mid-Range Build (~$1,500)

  • GPU: RTX 4090 24GB (~$1,100) - the sweet spot for local AI
  • CPU: Ryzen 7 7700X or Intel i7-14700K
  • RAM: 64GB DDR5
  • Storage: 2TB NVMe SSD
  • Runs: Up to 34B at Q4, 13B at Q8, 7B at FP16. Fast inference (~80-120 tok/s on 7B Q4)

High-End Build (~$3,000+)

  • GPU: RTX 5090 32GB (~$2,000) or 2Ɨ RTX 4090 24GB
  • CPU: Ryzen 9 9900X or Intel i9-14900K
  • RAM: 128GB DDR5
  • Storage: 4TB NVMe SSD
  • PSU: 1200W+ (essential for dual GPU)
  • Runs: 70B at Q4 (dual GPU), 34B at Q8, multiple models simultaneously

Dual GPU tip: llama.cpp and vLLM both support splitting models across multiple GPUs. Two RTX 4090s (48GB total) can run Llama 3.1 70B at Q4 with room for context. Make sure your motherboard has two x16 PCIe slots and your PSU can handle the load (850W+ for dual 4090s).

Ollama

Ollama is the easiest way to run LLMs locally. One command to install, one command to run a model. It handles model downloading, quantization selection, GPU detection, and serves an OpenAI-compatible API - all out of the box.

Installation

# macOS / Linux - one-liner install
curl -fsSL https://ollama.com/install.sh | sh

# Windows - download from https://ollama.com/download
# Or via winget:
winget install Ollama.Ollama

# Verify installation
ollama --version

Ollama automatically detects your GPU (NVIDIA CUDA, AMD ROCm, Apple Metal) and uses it for inference. No manual driver configuration needed in most cases.

Model Management

# Pull models (downloads to ~/.ollama/models)
ollama pull llama3.1           # Default 8B, Q4_K_M quantization (~4.7GB)
ollama pull llama3.1:70b       # 70B parameter model (~40GB)
ollama pull llama3.1:8b-q8_0   # 8B at higher quality 8-bit quantization
ollama pull mistral             # Mistral 7B
ollama pull deepseek-coder-v2   # DeepSeek Coder V2
ollama pull phi4                # Microsoft Phi-4 (small but capable)

# List downloaded models
ollama list

# Show model details (parameters, quantization, size)
ollama show llama3.1

# Remove a model
ollama rm mistral

# Copy/rename a model
ollama cp llama3.1 my-llama

Running Models

# Interactive chat
ollama run llama3.1

# Single prompt (non-interactive)
ollama run llama3.1 "Explain quantum computing in 3 sentences"

# Pipe input
echo "Summarize this:" | cat article.txt | ollama run llama3.1

# With system prompt
ollama run llama3.1 --system "You are a senior Python developer. Be concise."

# Set context window
ollama run llama3.1 --num-ctx 8192

API Usage

Ollama serves a REST API on localhost:11434 by default. It's compatible with the OpenAI API format, so most tools that work with OpenAI also work with Ollama.

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a Python function to merge two sorted lists",
  "stream": false
}'

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a CSV file in Python?"}
  ]
}'

# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3.1",
  "prompt": "The quick brown fox jumps over the lazy dog"
}'

Python Integration

# pip install ollama
import ollama

# Simple generation
response = ollama.generate(model='llama3.1', prompt='Explain Docker in one paragraph')
print(response['response'])

# Chat with message history
messages = [
    {'role': 'system', 'content': 'You are a database expert.'},
    {'role': 'user', 'content': 'When should I use PostgreSQL vs MongoDB?'},
]
response = ollama.chat(model='llama3.1', messages=messages)
print(response['message']['content'])

# Streaming response
for chunk in ollama.chat(model='llama3.1', messages=messages, stream=True):
    print(chunk['message']['content'], end='', flush=True)

# Using the OpenAI-compatible client
from openai import OpenAI

client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)

Custom Modelfile

Modelfiles let you create custom model configurations with specific system prompts, parameters, and adapters:

# Save as Modelfile
FROM llama3.1

# Set the system prompt
SYSTEM """You are a senior full-stack developer specializing in TypeScript,
React, and Node.js. You write clean, well-tested code. You explain your
reasoning briefly before providing code. Always include error handling."""

# Tune parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
# Create and run the custom model
ollama create coding-assistant -f Modelfile
ollama run coding-assistant "Write a REST API endpoint for user registration"

GPU Acceleration & Multi-Model Serving

Ollama automatically loads models onto your GPU. You can control GPU layer offloading and run multiple models concurrently:

# Set number of GPU layers (0 = CPU only, 99 = all layers on GPU)
OLLAMA_NUM_GPU=99 ollama serve

# Set GPU memory limit
OLLAMA_GPU_MEMORY=22GB ollama serve

# Run multiple models - Ollama keeps recently used models in memory
# In terminal 1:
curl http://localhost:11434/api/generate -d '{"model": "llama3.1", "prompt": "Hello"}'
# In terminal 2 (loads second model if VRAM allows):
curl http://localhost:11434/api/generate -d '{"model": "mistral", "prompt": "Hello"}'

# Set how long models stay loaded (default: 5 minutes)
OLLAMA_KEEP_ALIVE=30m ollama serve

Ollama is the recommended starting point for anyone new to local AI. It abstracts away the complexity of model formats, quantization, and GPU configuration. Once you outgrow it - needing custom quantizations, production-grade serving, or multi-GPU setups - move to llama.cpp or vLLM.

llama.cpp

llama.cpp is the engine under Ollama's hood - a high-performance C/C++ implementation of LLM inference. It supports GGUF model format, extensive quantization options, GPU offloading via CUDA/Metal/Vulkan, and a built-in HTTP server. When you need maximum control over inference, this is the tool.

Building from Source

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support (NVIDIA GPUs)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

# Build with Metal support (Apple Silicon)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j$(sysctl -n hw.ncpu)

# Build with Vulkan support (AMD GPUs, cross-platform)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

# CPU-only build (no GPU acceleration)
cmake -B build
cmake --build build --config Release -j$(nproc)

GGUF Format & Quantization Levels

GGUF is the standard model format for llama.cpp. Models are available pre-quantized on Hugging Face, or you can quantize them yourself. Quantization reduces model size and VRAM usage at the cost of some quality:

Quantization Bits/Weight Size (7B model) Quality Speed Use Case
F16 16 ~14 GB ā˜…ā˜…ā˜…ā˜…ā˜… Slowest Reference quality, research
Q8_0 8 ~7.5 GB ā˜…ā˜…ā˜…ā˜…ā˜… Fast Near-lossless, recommended if VRAM allows
Q6_K 6.5 ~5.5 GB ā˜…ā˜…ā˜…ā˜…ā˜† Fast High quality with good compression
Q5_K_M 5.5 ~5.0 GB ā˜…ā˜…ā˜…ā˜…ā˜† Faster Good balance of quality and size
Q4_K_M 4.8 ~4.4 GB ā˜…ā˜…ā˜…ā˜†ā˜† Fastest Best bang for buck - recommended default
Q3_K_M 3.9 ~3.5 GB ā˜…ā˜…ā˜…ā˜†ā˜† Fastest When VRAM is very tight
Q2_K 2.6 ~2.8 GB ā˜…ā˜…ā˜†ā˜†ā˜† Fastest Significant quality loss - last resort

Running the Server

# Download a GGUF model from Hugging Face
# Example: https://huggingface.co/TheBloke/Llama-3.1-8B-GGUF

# Start the server with full GPU offloading
./build/bin/llama-server \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080

# Key flags:
#   -m        Model path
#   -ngl 99   Offload all layers to GPU (set lower for partial offload)
#   -c 8192   Context window size (tokens)
#   -t 8      Number of CPU threads (for CPU layers)
#   --host    Bind address
#   --port    Port number

Performance Tuning

# Optimized server launch for RTX 4090
./build/bin/llama-server \
  -m models/llama-3.1-8b-q4_k_m.gguf \
  -ngl 99 \
  -c 8192 \
  -fa \                    # Enable Flash Attention
  --mlock \                # Lock model in RAM (prevent swapping)
  -cb \                    # Enable continuous batching
  --parallel 4 \           # Handle 4 concurrent requests
  -b 2048 \                # Batch size for prompt processing
  --threads-http 4         # HTTP server threads

# For CPU inference, maximize thread usage:
./build/bin/llama-server \
  -m models/phi-4-q4_k_m.gguf \
  -ngl 0 \
  -c 4096 \
  -t $(nproc) \
  --mlock

Quantizing Your Own Models

# Convert a Hugging Face model to GGUF
python convert_hf_to_gguf.py /path/to/model --outfile model-f16.gguf --outtype f16

# Quantize to Q4_K_M
./build/bin/llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Quantize to Q8_0
./build/bin/llama-quantize model-f16.gguf model-q8_0.gguf Q8_0

The llama-server exposes an OpenAI-compatible API at /v1/chat/completions, so you can use it as a drop-in replacement for OpenAI in any application.

vLLM

vLLM is a production-grade LLM serving engine designed for high throughput. If Ollama is for development and llama.cpp is for control, vLLM is for serving models at scale. It's the engine behind many commercial LLM APIs.

Key Features

  • PagedAttention: vLLM's signature innovation. It manages KV cache memory like an OS manages virtual memory - using paging to eliminate memory waste. This allows 2-4x more concurrent requests compared to naive implementations.
  • Continuous Batching: New requests are added to running batches dynamically, maximizing GPU utilization. No waiting for a batch to complete before starting new requests.
  • Tensor Parallelism: Split a single model across multiple GPUs for models that don't fit in one GPU's VRAM.
  • OpenAI-Compatible API: Drop-in replacement for the OpenAI API, including streaming, function calling, and tool use.
  • Speculative Decoding: Use a small draft model to predict tokens, then verify with the large model - can improve throughput by 2-3x.

Installation & Basic Usage

# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm

# Start the server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

# Query it (OpenAI-compatible)
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "messages": [{"role": "user", "content": "Explain PagedAttention"}],
  "max_tokens": 512,
  "temperature": 0.7
}'

Multi-GPU with Tensor Parallelism

# Run Llama 3.1 70B across 2 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 \
  --port 8000

# Run across 4 GPUs for maximum throughput
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 16384

Docker Deployment

# Production Docker deployment
docker run -d \
  --name vllm-server \
  --gpus all \
  --shm-size=16g \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Performance Benchmarks

Setup Model Throughput (tok/s) Latency (first token) Concurrent Users
1Ɨ RTX 4090 Llama 3.1 8B ~2,800 ~45ms 32+
1Ɨ RTX 4090 Mistral 7B ~3,200 ~40ms 32+
2Ɨ RTX 4090 Llama 3.1 70B (Q4) ~1,100 ~120ms 16+
1Ɨ A100 80GB Llama 3.1 70B ~1,800 ~80ms 24+
1Ɨ RTX 5090 Llama 3.1 8B ~3,500 ~35ms 48+

Throughput measured with continuous batching at maximum concurrency. Single-user generation speed is typically 80-150 tok/s for 8B models on RTX 4090.

When to use vLLM over Ollama: Use vLLM when you need to serve multiple concurrent users, want maximum throughput, need tensor parallelism for large models, or are building a production API. Use Ollama for development, experimentation, and single-user scenarios.

Other Tools

The local AI ecosystem is rich with options. Here are the other major players worth knowing about:

LocalAI

A drop-in replacement for the OpenAI API that runs entirely locally. LocalAI supports LLMs, image generation (Stable Diffusion), audio transcription (Whisper), text-to-speech, and embeddings - all through a single OpenAI-compatible API. It's ideal if you want to replace OpenAI across your entire stack without changing any client code.

# Run with Docker
docker run -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# It's OpenAI-compatible - just change the base URL
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello!"}]
)

LM Studio

A polished desktop GUI for running local models. Download models from Hugging Face with one click, chat with them, and serve them via a local API. LM Studio is the best option for people who want a visual interface without touching the command line. Available on macOS, Windows, and Linux.

GPT4All

An open-source desktop chatbot by Nomic AI. It focuses on ease of use and runs on CPU by default, making it accessible on any hardware. GPT4All includes a local document Q&A feature - drop in PDFs, text files, or code and chat with them using local embeddings and retrieval.

text-generation-webui (oobabooga)

A feature-rich web UI for running LLMs locally. It supports multiple backends (llama.cpp, ExLlamaV2, Transformers, AutoGPTQ), has extensions for character chat, multimodal models, and API serving. It's the Swiss Army knife of local AI interfaces - powerful but with a steeper learning curve.

Jan.ai

An open-source ChatGPT alternative that runs 100% offline. Clean, modern UI with model management, conversation history, and extension support. Jan focuses on being a polished end-user product rather than a developer tool.

Tool Comparison

Tool Interface Best For GPU Required API Server Difficulty
Ollama CLI + API Development, quick setup Optional āœ… OpenAI-compatible Easy
llama.cpp CLI + API Maximum control, custom builds Optional āœ… OpenAI-compatible Medium
vLLM API Production serving, throughput Yes (NVIDIA) āœ… OpenAI-compatible Medium
LocalAI API Full OpenAI replacement Optional āœ… OpenAI-compatible Medium
LM Studio Desktop GUI Non-technical users, exploration Optional āœ… Built-in Easy
GPT4All Desktop GUI CPU inference, document Q&A No āœ… Basic Easy
oobabooga Web UI Power users, multiple backends Optional āœ… Extension Hard
Jan.ai Desktop GUI ChatGPT replacement, end users Optional āœ… Built-in Easy

Model Selection Guide

Choosing the right model is as important as choosing the right hardware. Here's a practical guide based on real-world testing across different use cases.

Best Models by Category

General Chat & Instruction Following

  • Llama 3.1 8B Instruct - The default recommendation. Excellent instruction following, strong reasoning, runs on any modern GPU. The 8B size hits the sweet spot of quality vs. resource usage.
  • Llama 3.1 70B Instruct - When you need GPT-4 class quality locally. Requires 40+ GB VRAM (dual GPU or quantized). Worth it for complex reasoning, analysis, and long-form content.
  • Mistral 7B Instruct - Slightly faster than Llama 3.1 8B with comparable quality for most tasks. Excellent at following structured output formats (JSON, XML).

Code Generation

  • DeepSeek Coder V2 (236B MoE / 16B Lite) - The best open-source coding model. The Lite version runs on a single GPU and outperforms GPT-4 on many coding benchmarks. Supports 338 programming languages.
  • CodeLlama 34B - Strong at code completion, infilling, and instruction-based code generation. The 34B variant is particularly good at understanding large codebases.
  • Qwen 2.5 Coder 7B - Excellent coding performance in a small package. Great for code completion in editors.

Creative Writing

  • Mistral Large (123B) - Exceptional at creative, nuanced writing. Requires significant VRAM but produces the most human-like prose among open models.
  • Llama 3.1 70B - Strong creative writing with good instruction following. Better at maintaining consistency over long outputs than smaller models.

Function Calling & Tool Use

  • Hermes 2 Pro (Llama 3.1 8B) - Purpose-built for function calling. Reliably generates structured tool-use JSON. Works great with agent frameworks.
  • Functionary v3 - Trained specifically on function calling data. Supports parallel function calls and complex tool chains.

Small & Fast (Edge / Mobile / CPU)

  • Phi-4 (14B) - Microsoft's small model that punches way above its weight. Excellent reasoning for its size. Runs well on CPU.
  • Phi-4 Mini (3.8B) - Surprisingly capable for under 4B parameters. Good for classification, extraction, and simple Q&A.
  • Gemma 2 2B - Google's tiny model. Fast enough for real-time applications on CPU. Good for embeddings and simple tasks.

Decision Flowchart

How to pick your model:

  1. What's your VRAM? This determines your maximum model size. Check the VRAM table above.
  2. What's your primary task? Coding → DeepSeek Coder. General → Llama 3.1. Function calling → Hermes. Small/fast → Phi-4.
  3. Quality vs. speed? Use the largest model that fits your VRAM at Q4_K_M. If you need faster responses, drop to a smaller model rather than lower quantization.
  4. Start with Q4_K_M quantization. Only go to Q8 if you have spare VRAM and notice quality issues. Only go to Q3 or Q2 if the model won't fit otherwise.
  5. Test with your actual use case. Benchmarks don't tell the whole story. Run your specific prompts and evaluate the outputs.

Model Size vs. Quality Rule of Thumb

A larger model at lower quantization almost always beats a smaller model at higher quantization. For example, Llama 3.1 70B at Q4 significantly outperforms Llama 3.1 8B at Q8 on reasoning tasks. Always maximize parameter count first, then optimize quantization.

Performance Optimization

Once you've picked your model and tool, these optimizations can squeeze 20-50% more performance out of your hardware.

Flash Attention

Flash Attention rewrites the attention computation to be memory-efficient and cache-friendly. Instead of materializing the full attention matrix (which grows quadratically with context length), it computes attention in tiles that fit in GPU SRAM. This reduces memory usage from O(n²) to O(n) and speeds up inference by 1.5-2x for long contexts.

# Enable in llama.cpp
./llama-server -m model.gguf -ngl 99 -fa

# Enable in vLLM (on by default if hardware supports it)
vllm serve model-name --enable-flash-attn

# Check if your GPU supports it (requires compute capability 8.0+)
# RTX 3090, 4090, 5090, A100, H100 all support Flash Attention

KV Cache Quantization

The KV (key-value) cache stores attention states for previously processed tokens. For long contexts, this cache can consume more VRAM than the model itself. Quantizing the KV cache to 8-bit or 4-bit reduces its memory footprint by 2-4x with minimal quality impact.

# llama.cpp: quantize KV cache to 8-bit
./llama-server -m model.gguf -ngl 99 -ctk q8_0 -ctv q8_0

# This allows much longer context windows on the same hardware
# Example: 8B model on RTX 4090
#   FP16 KV cache: max ~16K context
#   Q8 KV cache:   max ~32K context
#   Q4 KV cache:   max ~48K context

Speculative Decoding

Speculative decoding uses a small "draft" model to predict multiple tokens ahead, then the large "target" model verifies them in a single forward pass. Since verification is parallelizable (unlike generation), this can improve throughput by 2-3x without any quality loss.

# llama.cpp: use Phi-4 Mini as draft model for Llama 3.1 70B
./llama-server \
  -m llama-3.1-70b-q4.gguf \
  -md phi-4-mini-q4.gguf \
  --draft-max 8 \
  --draft-min 1 \
  -ngl 99 -ngld 99

# vLLM: speculative decoding
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 5

Context Length vs. Speed Tradeoff

Longer context windows consume more VRAM (for KV cache) and slow down inference. Each doubling of context length roughly halves your maximum concurrent users and increases per-token latency. Set context length to what you actually need, not the model's maximum.

Context Length KV Cache (8B model, FP16) Relative Speed Good For
2,048 ~0.5 GB 1.0x (fastest) Short Q&A, classification
4,096 ~1.0 GB 0.95x Chat, code completion
8,192 ~2.0 GB 0.85x Document analysis, RAG
16,384 ~4.0 GB 0.70x Long documents, multi-turn chat
32,768 ~8.0 GB 0.50x Book-length analysis
131,072 ~32 GB 0.20x Full codebase context (needs KV quant)

Batch Size Tuning

The batch size (-b in llama.cpp) controls how many tokens are processed in parallel during prompt evaluation (the "prefill" phase). Larger batches use more VRAM but process prompts faster. For single-user scenarios, 512-2048 is optimal. For multi-user serving, match batch size to your expected concurrent prompt lengths.

Hardware + Model Benchmarks

Hardware Model Quantization Generation (tok/s) Prompt Eval (tok/s)
RTX 4060 8GB Llama 3.1 8B Q4_K_M ~45 ~800
RTX 4090 24GB Llama 3.1 8B Q4_K_M ~120 ~3,500
RTX 4090 24GB Llama 3.1 8B Q8_0 ~95 ~2,800
RTX 4090 24GB DeepSeek Coder V2 16B Q4_K_M ~65 ~1,800
RTX 5090 32GB Llama 3.1 8B Q4_K_M ~155 ~4,500
2Ɨ RTX 4090 Llama 3.1 70B Q4_K_M ~25 ~600
M4 Max 64GB Llama 3.1 8B Q4_K_M ~75 ~2,000
M4 Max 128GB Llama 3.1 70B Q4_K_M ~18 ~350
Ryzen 9 7950X (CPU) Phi-4 14B Q4_K_M ~12 ~180
Ryzen 9 7950X (CPU) Llama 3.1 8B Q4_K_M ~15 ~220

Benchmarks measured with llama.cpp, single-user, 512-token generation after a 256-token prompt. Your results will vary based on context length, system load, and specific model variant.

Integration Patterns

Running a model locally is step one. Integrating it into your development workflow and applications is where the real value comes in.

LangChain with Local Models

LangChain works seamlessly with local models via the OpenAI-compatible APIs that Ollama, llama.cpp, and vLLM all provide:

# pip install langchain langchain-openai
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Point LangChain at your local Ollama instance
llm = ChatOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3.1",
    temperature=0.3,
)

# Build a chain
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a technical writer. Be concise and accurate."),
    ("user", "{input}"),
])
chain = prompt | llm | StrOutputParser()

# Run it
result = chain.invoke({"input": "Explain Docker networking modes"})
print(result)

# RAG with local embeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader

# Load and split documents
loader = DirectoryLoader("./docs", glob="**/*.md")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

# Create vector store with local embeddings
embeddings = OllamaEmbeddings(model="llama3.1")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Query with retrieval
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa_chain.invoke("How do I configure the database connection?")
print(answer["result"])

Local Chatbot with Gradio

Build a web-based chat interface for your local model in under 30 lines:

# pip install gradio ollama
import gradio as gr
import ollama

def chat(message, history):
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for user_msg, bot_msg in history:
        messages.append({"role": "user", "content": user_msg})
        messages.append({"role": "assistant", "content": bot_msg})
    messages.append({"role": "user", "content": message})

    response = ""
    for chunk in ollama.chat(model="llama3.1", messages=messages, stream=True):
        response += chunk["message"]["content"]
        yield response

demo = gr.ChatInterface(
    fn=chat,
    title="Local AI Chat",
    description="Powered by Llama 3.1 running on your hardware",
    examples=["Explain Kubernetes in simple terms", "Write a Python web scraper"],
    retry_btn="Retry",
    undo_btn="Undo",
    clear_btn="Clear",
)

demo.launch(server_name="0.0.0.0", server_port=7860)

VS Code Integration with Continue.dev

Continue.dev is an open-source AI code assistant for VS Code and JetBrains that works with local models. It provides inline code completion, chat, and edit capabilities - all powered by your local Ollama or llama.cpp instance.

// ~/.continue/config.json
{
  "models": [
    {
      "title": "Llama 3.1 8B (Local)",
      "provider": "ollama",
      "model": "llama3.1",
      "apiBase": "http://localhost:11434"
    },
    {
      "title": "DeepSeek Coder (Local)",
      "provider": "ollama",
      "model": "deepseek-coder-v2",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Coder 7B",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "llama3.1",
    "apiBase": "http://localhost:11434"
  }
}

After configuring, you get Copilot-like tab completion, an AI chat panel, and inline edit capabilities - all running on your local hardware with zero API costs and complete privacy.

Local Embeddings with sentence-transformers

For RAG pipelines, semantic search, and clustering, you need embeddings. Running them locally is fast and free:

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a high-quality embedding model (~100MB)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Generate embeddings
documents = [
    "Kubernetes orchestrates containerized applications",
    "Docker packages applications into containers",
    "Terraform manages infrastructure as code",
    "PostgreSQL is a relational database",
]
embeddings = model.encode(documents)

# Semantic search
query = "container orchestration"
query_embedding = model.encode([query])

# Cosine similarity
similarities = np.dot(embeddings, query_embedding.T).flatten()
ranked = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

for doc, score in ranked:
    print(f"{score:.3f}: {doc}")
# Output:
# 0.782: Kubernetes orchestrates containerized applications
# 0.654: Docker packages applications into containers
# 0.231: Terraform manages infrastructure as code
# 0.118: PostgreSQL is a relational database
# For higher quality embeddings, use a larger model:
# pip install sentence-transformers
model = SentenceTransformer("BAAI/bge-large-en-v1.5")  # ~1.3GB, top-tier quality

# Or use Ollama's built-in embeddings:
import ollama
response = ollama.embeddings(model="llama3.1", prompt="Your text here")
embedding = response["embedding"]  # 4096-dimensional vector

The local AI stack: Ollama (model serving) + Continue.dev (IDE integration) + LangChain (application framework) + sentence-transformers (embeddings) + Chroma (vector store) gives you a complete, private, zero-cost AI development environment that rivals cloud-based solutions for most tasks.

Getting Started Today

Here's the fastest path from zero to running local AI:

  1. Install Ollama - takes 30 seconds on any platform.
  2. Pull a model: ollama pull llama3.1 - downloads ~4.7GB.
  3. Start chatting: ollama run llama3.1 - you're running AI locally.
  4. Add IDE integration: Install Continue.dev in VS Code, point it at Ollama.
  5. Build something: Use the LangChain or Gradio examples above to build your first local AI application.

The open-source AI ecosystem has reached a tipping point. Models are good enough, tools are mature enough, and hardware is affordable enough that running AI locally is no longer a compromise - it's a competitive advantage. You get privacy, zero marginal cost, full customization, and independence from API providers. The only question is which model to start with.

Want More?

Check out our AI tools comparison for cloud-based alternatives, or dive into our hands-on tutorials for step-by-step project guides.