Skip to content
Docker Compose production architecture diagram with services and networking

Docker Compose for Production - 2026 Guide

Tutorial
DevOps

Docker Compose v2 Changes

Docker Compose v2 replaced the Python-based docker-compose binary with a Go plugin integrated directly into the Docker CLI. The command changed from docker-compose up to docker compose up (no hyphen). As of 2026, v1 is fully deprecated and no longer receives security patches.

Key differences that matter for production:

# Check your version - must be v2.x
docker compose version
# Docker Compose version v2.32.4

# Validate without starting
docker compose -f docker-compose.prod.yml config
docker compose -f docker-compose.prod.yml up --dry-run
Compose file version field is obsolete. Drop the version: "3.8" line from your files. Compose v2 ignores it and uses the latest schema automatically. Keeping it triggers a deprecation warning.

Production Patterns

Multi-stage Builds

Multi-stage builds keep production images small by separating build dependencies from runtime. A typical Node.js API image drops from 1.2 GB to under 150 MB.

# Dockerfile - multi-stage production build
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

FROM node:22-alpine AS production
RUN addgroup -g 1001 appgroup && adduser -u 1001 -G appgroup -s /bin/sh -D appuser
WORKDIR /app
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]

Health Checks

Health checks tell Docker whether a container is actually working, not just running. Without them, a container with a crashed process but alive PID 1 stays in the "running" state forever. Compose uses health checks to control startup order via depends_on conditions.

services:
  api:
    build:
      context: .
      target: production
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      start_period: 10s
      retries: 3
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy

  db:
    image: postgres:17-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U $${POSTGRES_USER} -d $${POSTGRES_DB}"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

Restart Policies

Production containers must restart automatically after crashes and host reboots. Use unless-stopped for most services and on-failure for one-shot tasks like migrations.

services:
  api:
    restart: unless-stopped   # survives host reboot, stops only on manual docker compose stop

  db:
    restart: unless-stopped

  migrate:
    restart: on-failure       # runs once, retries on failure, stays stopped on success
    command: ["npm", "run", "migrate"]

Resource Limits

Without resource limits, a single runaway container can consume all host memory and crash everything. Set both limits (hard ceiling) and reservations (guaranteed minimum).

services:
  api:
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M

  db:
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 1G
        reservations:
          cpus: "0.5"
          memory: 256M
OOM kills are silent by default. Check docker inspect --format='{{.State.OOMKilled}}' container_name after unexpected restarts. If a container keeps getting OOM-killed, raise its memory limit or fix the leak.

Networking and Service Discovery

Docker Compose creates a default bridge network for each project, but production stacks benefit from explicit custom networks. Custom networks provide isolation between service groups and control which containers can talk to each other.

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true    # no external internet access
  monitoring:
    driver: bridge

services:
  traefik:
    networks:
      - frontend

  api:
    networks:
      - frontend      # reachable by traefik
      - backend        # can reach db and redis

  db:
    networks:
      - backend        # only reachable by api, NOT by traefik

  redis:
    networks:
      - backend

  prometheus:
    networks:
      - frontend      # scrape traefik metrics
      - backend        # scrape api metrics
      - monitoring

Service discovery works through Docker's built-in DNS. Every service name resolves to its container IP within shared networks. Your API connects to Postgres at db:5432 and Redis at redis:6379 with zero configuration.

# In your API's environment
services:
  api:
    environment:
      DATABASE_URL: "postgresql://app:secret@db:5432/myapp"
      REDIS_URL: "redis://redis:6379/0"
DNS caching gotcha. Docker's embedded DNS caches records for the container's lifetime. If a dependent service restarts and gets a new IP, the client container may hold a stale entry. Health checks and connection retry logic in your application handle this. Libraries like pg for Node.js and sqlalchemy for Python support automatic reconnection.

For multi-host networking, Docker Compose alone is not enough. You need Kubernetes or Docker Swarm with overlay networks. On a single host, bridge networks handle everything.

Secrets Management

Environment variables are the most common way to pass credentials to containers, and the most dangerous. They show up in docker inspect, process listings, crash dumps, and log output. Docker Compose supports file-based secrets that mount as read-only files inside the container.

secrets:
  db_password:
    file: ./secrets/db_password.txt
  api_key:
    file: ./secrets/api_key.txt

services:
  api:
    secrets:
      - api_key
    environment:
      # Read the secret from the mounted file
      API_KEY_FILE: /run/secrets/api_key

  db:
    secrets:
      - db_password
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password

The secrets mount at /run/secrets/<name> as read-only files. Many official images (Postgres, MySQL, Redis) support the _FILE suffix convention, reading the credential from a file path instead of a plain environment variable.

For applications that do not support _FILE variables, use an entrypoint script:

#!/bin/sh
# entrypoint.sh - load secrets from files into env vars
export DB_PASSWORD=$(cat /run/secrets/db_password)
export API_KEY=$(cat /run/secrets/api_key)
exec "$@"
services:
  api:
    entrypoint: ["/app/entrypoint.sh"]
    command: ["node", "dist/server.js"]
    secrets:
      - db_password
      - api_key
Never commit secrets files to git. Add secrets/ to your .gitignore. For CI/CD, inject secrets from your pipeline's secret store (GitHub Actions secrets, AWS Secrets Manager, or HashiCorp Vault) and write them to files before running docker compose up.

For production deployments on AWS, consider pulling secrets at startup from AWS Secrets Manager using an init container or entrypoint script with the AWS CLI.

Volumes and Persistence

Named volumes persist data across container restarts and recreations. Without them, every docker compose down destroys your database. Named volumes are the only safe option for production data.

volumes:
  postgres_data:
    driver: local
  redis_data:
    driver: local
  prometheus_data:
    driver: local
  grafana_data:
    driver: local

services:
  db:
    image: postgres:17-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
    command: ["redis-server", "--appendonly", "yes"]

Key volume rules for production:

# Backup a named volume
docker run --rm \
  -v postgres_data:/source:ro \
  -v $(pwd)/backups:/backup \
  alpine tar czf /backup/postgres-$(date +%Y%m%d).tar.gz -C /source .

# Restore a volume
docker run --rm \
  -v postgres_data:/target \
  -v $(pwd)/backups:/backup \
  alpine tar xzf /backup/postgres-20260502.tar.gz -C /target

Monitoring Stack

A production stack without monitoring is flying blind. The standard open-source monitoring stack uses four components: Prometheus collects metrics, Grafana visualizes them, cAdvisor exposes container metrics, and Node Exporter exposes host metrics. All four run as Compose services. For a deeper dive, see our Observability Stack guide.

services:
  prometheus:
    image: prom/prometheus:v3.2.1
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    ports:
      - "9090:9090"
    networks:
      - monitoring
      - backend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 512M

  grafana:
    image: grafana/grafana:11.5.2
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    environment:
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: "https://grafana.example.com"
    secrets:
      - grafana_password
    ports:
      - "3001:3000"
    networks:
      - monitoring
    restart: unless-stopped
    depends_on:
      prometheus:
        condition: service_started

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.51.0
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 128M

  node-exporter:
    image: prom/node-exporter:v1.9.0
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    ports:
      - "9100:9100"
    networks:
      - monitoring
    restart: unless-stopped

The Prometheus configuration scrapes all four targets:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "api"
    static_configs:
      - targets: ["api:3000"]
    metrics_path: /metrics

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
Expose a /metrics endpoint in your API. Use prom-client for Node.js, prometheus_client for Python, or prometheus/client_golang for Go. Track request duration, error rates, and active connections at minimum.

Reverse Proxy with Traefik

Traefik is the best reverse proxy for Docker Compose because it auto-discovers services through Docker labels. No config file updates when you add or remove services. It handles TLS certificates automatically via Let's Encrypt.

services:
  traefik:
    image: traefik:v3.3
    command:
      - "--api.dashboard=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
      - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
      - "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.email=admin@example.com"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
      - "--metrics.prometheus=true"
      - "--accesslog=true"
      - "--accesslog.format=json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - letsencrypt_data:/letsencrypt
    networks:
      - frontend
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.dashboard.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.dashboard.service=api@internal"
      - "traefik.http.routers.dashboard.tls.certresolver=letsencrypt"
      - "traefik.http.routers.dashboard.middlewares=auth"
      - "traefik.http.middlewares.auth.basicauth.users=admin:$$apr1$$xyz$$hashedpassword"

  api:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.api.rule=Host(`api.example.com`)"
      - "traefik.http.routers.api.tls.certresolver=letsencrypt"
      - "traefik.http.routers.api.entrypoints=websecure"
      - "traefik.http.services.api.loadbalancer.server.port=3000"
      - "traefik.http.middlewares.api-ratelimit.ratelimit.average=100"
      - "traefik.http.middlewares.api-ratelimit.ratelimit.burst=50"
      - "traefik.http.routers.api.middlewares=api-ratelimit"

  grafana:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`grafana.example.com`)"
      - "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"

Traefik reads Docker labels at runtime. When you scale a service with docker compose up --scale api=3, Traefik automatically load-balances across all three instances. Certificates renew automatically 30 days before expiry.

Mounting the Docker socket is a security risk. The socket gives Traefik (and anyone who compromises it) full control over Docker. Mitigate this by running Traefik with read-only filesystem, dropping all capabilities, and using a Docker socket proxy like tecnativa/docker-socket-proxy that exposes only the read endpoints Traefik needs.

CI/CD with GitHub Actions

A production Compose deployment needs automated builds, image scanning, and zero-downtime deploys. GitHub Actions handles the full pipeline: build, scan, push to a registry, SSH into the server, pull new images, and restart services.

# .github/workflows/deploy.yml
name: Deploy Production
on:
  push:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push image
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          target: production
          tags: |
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
            ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Scan image with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
          format: "sarif"
          output: "trivy-results.sarif"
          severity: "CRITICAL,HIGH"
          exit-code: "1"

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production server
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.PROD_HOST }}
          username: ${{ secrets.PROD_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            cd /opt/myapp
            docker compose -f docker-compose.prod.yml pull api
            docker compose -f docker-compose.prod.yml up -d --no-deps api
            docker image prune -f

The deploy step pulls only the updated image and restarts only the API service (--no-deps), leaving the database, Redis, and monitoring stack untouched. This gives you near-zero-downtime deployments.

For true zero-downtime, use Traefik's health check integration. Traefik waits for the new container to pass its health check before routing traffic to it and draining the old one.

# Add to your api service labels
labels:
  - "traefik.http.services.api.loadbalancer.healthcheck.path=/health"
  - "traefik.http.services.api.loadbalancer.healthcheck.interval=5s"

Scaling and When to Graduate to K8s

Docker Compose supports horizontal scaling on a single host with the --scale flag or the deploy.replicas key. Combined with Traefik's auto-discovery, this gives you basic load balancing without any extra configuration.

services:
  api:
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
# Scale dynamically
docker compose -f docker-compose.prod.yml up -d --scale api=5

# Check running instances
docker compose ps api

Compose scaling has hard limits. All replicas run on the same host, so you are bounded by that machine's CPU and memory. There is no automatic failover if the host goes down.

CapabilityDocker ComposeKubernetes
Single-host scalingYes (replicas flag)Yes
Multi-host scalingNoYes (auto)
Auto-scaling on loadNoHPA, VPA, KEDA
Self-healingRestart policies onlyFull pod rescheduling
Rolling updatesManual (pull + up)Built-in with rollback
Service meshNoIstio, Linkerd, Cilium
Secret rotationManual restartAutomatic with CSI driver
ComplexityLow (one YAML file)High (many abstractions)
Ops overheadMinimalSignificant (or use managed)

Stay with Compose when: you run on a single server, traffic fits one machine, your team is small, and you value simplicity over features. Many SaaS products serve thousands of users from a single well-provisioned host running Compose.

Graduate to Kubernetes when: you need multi-node high availability, auto-scaling based on CPU or custom metrics, canary deployments, or your team has the bandwidth to manage the added complexity. Managed Kubernetes (EKS, GKE, AKS) reduces the ops burden significantly.

The middle ground exists. Docker Swarm mode uses the same Compose file format with docker stack deploy and supports multi-node clusters. It is simpler than Kubernetes but less feature-rich. Consider it if you need two or three nodes but not the full Kubernetes ecosystem.

Security Hardening

Default Docker containers run with more privileges than they need. Production hardening means reducing the attack surface: read-only filesystems, dropped capabilities, non-root users, and vulnerability scanning.

Read-only Filesystem

A read-only root filesystem prevents attackers from writing malware, modifying binaries, or planting backdoors inside a compromised container. Use tmpfs mounts for directories that need write access.

services:
  api:
    read_only: true
    tmpfs:
      - /tmp
      - /var/run
    volumes:
      - app_logs:/app/logs    # only specific dirs are writable

Drop Capabilities and Prevent Privilege Escalation

Linux capabilities grant fine-grained root powers. Drop all of them and add back only what the container actually needs. The no-new-privileges flag prevents processes inside the container from gaining additional privileges through setuid binaries or capability inheritance.

services:
  api:
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE    # only if binding to ports below 1024

  db:
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - SETUID
      - SETGID
      - FOWNER
      - DAC_OVERRIDE

Non-root Users

Running as root inside a container means a container escape gives the attacker root on the host. Always specify a non-root user in your Dockerfile or Compose file.

services:
  api:
    user: "1001:1001"    # matches the appuser created in Dockerfile

Vulnerability Scanning with Trivy

Scan every image before it reaches production. Trivy checks for OS package vulnerabilities, language-specific dependency issues, and misconfigurations in Dockerfiles.

# Scan a local image
trivy image myapp:latest

# Scan and fail on critical/high vulnerabilities
trivy image --severity CRITICAL,HIGH --exit-code 1 myapp:latest

# Scan a Dockerfile for misconfigurations
trivy config Dockerfile

# Scan a running Compose stack
for img in $(docker compose images -q); do
  trivy image "$img"
done
# Example Trivy output
myapp:latest (alpine 3.21.3)
Total: 0 (CRITICAL: 0, HIGH: 0)

Node.js (node_modules/package-lock.json)
Total: 1 (HIGH: 1)
+-----------+------------------+----------+-------------------+---------------+
| Library   | Vulnerability    | Severity | Installed Version | Fixed Version |
+-----------+------------------+----------+-------------------+---------------+
| lodash    | CVE-2025-XXXXX   | HIGH     | 4.17.20           | 4.17.22       |
+-----------+------------------+----------+-------------------+---------------+
Security checklist for every production Compose service:
  • Non-root user in Dockerfile and Compose
  • read_only: true with targeted tmpfs mounts
  • no-new-privileges:true in security_opt
  • cap_drop: ALL with minimal cap_add
  • Trivy scan in CI pipeline with exit-code 1 on HIGH/CRITICAL
  • Pin image tags to digests, not :latest
  • Use internal: true networks for backend services

Complete Production Compose File

Here is the full production docker-compose.prod.yml combining everything from this guide: API, Postgres, Redis, Traefik with auto-SSL, and the complete monitoring stack. Copy this as your starting point and customize the domain names, image references, and resource limits for your workload.

# docker-compose.prod.yml - Complete production stack
# Usage: docker compose -f docker-compose.prod.yml up -d

secrets:
  db_password:
    file: ./secrets/db_password.txt
  api_key:
    file: ./secrets/api_key.txt
  grafana_password:
    file: ./secrets/grafana_password.txt

networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true
  monitoring:
    driver: bridge

volumes:
  postgres_data:
  redis_data:
  letsencrypt_data:
  prometheus_data:
  grafana_data:

services:
  # ---- Reverse Proxy ----
  traefik:
    image: traefik:v3.3
    container_name: traefik
    command:
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
      - "--entrypoints.web.http.redirections.entrypoint.scheme=https"
      - "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.email=admin@example.com"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
      - "--metrics.prometheus=true"
      - "--accesslog=true"
      - "--accesslog.format=json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - letsencrypt_data:/letsencrypt
    networks:
      - frontend
    restart: unless-stopped
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
        reservations:
          cpus: "0.1"
          memory: 64M
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.dashboard.rule=Host(`traefik.example.com`)"
      - "traefik.http.routers.dashboard.service=api@internal"
      - "traefik.http.routers.dashboard.tls.certresolver=letsencrypt"

  # ---- Application ----
  api:
    build:
      context: .
      dockerfile: Dockerfile
      target: production
    image: ghcr.io/myorg/myapp:latest
    container_name: api
    secrets:
      - db_password
      - api_key
    environment:
      NODE_ENV: production
      DATABASE_URL: "postgresql://app:secret@db:5432/myapp"
      REDIS_URL: "redis://redis:6379/0"
      DB_PASSWORD_FILE: /run/secrets/db_password
      API_KEY_FILE: /run/secrets/api_key
    networks:
      - frontend
      - backend
    restart: unless-stopped
    read_only: true
    tmpfs:
      - /tmp
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    user: "1001:1001"
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      start_period: 15s
      retries: 3
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.api.rule=Host(`api.example.com`)"
      - "traefik.http.routers.api.tls.certresolver=letsencrypt"
      - "traefik.http.routers.api.entrypoints=websecure"
      - "traefik.http.services.api.loadbalancer.server.port=3000"
      - "traefik.http.services.api.loadbalancer.healthcheck.path=/health"
      - "traefik.http.services.api.loadbalancer.healthcheck.interval=5s"
      - "traefik.http.middlewares.api-ratelimit.ratelimit.average=100"
      - "traefik.http.middlewares.api-ratelimit.ratelimit.burst=50"
      - "traefik.http.routers.api.middlewares=api-ratelimit"

  # ---- Database ----
  db:
    image: postgres:17-alpine
    container_name: db
    secrets:
      - db_password
    environment:
      POSTGRES_DB: myapp
      POSTGRES_USER: app
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    networks:
      - backend
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - SETUID
      - SETGID
      - FOWNER
      - DAC_OVERRIDE
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d myapp"]
      interval: 10s
      timeout: 5s
      retries: 5
    deploy:
      resources:
        limits:
          cpus: "2.0"
          memory: 1G
        reservations:
          cpus: "0.5"
          memory: 256M

  # ---- Cache ----
  redis:
    image: redis:7-alpine
    container_name: redis
    command: ["redis-server", "--appendonly", "yes", "--maxmemory", "256mb", "--maxmemory-policy", "allkeys-lru"]
    volumes:
      - redis_data:/data
    networks:
      - backend
    restart: unless-stopped
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 300M
        reservations:
          cpus: "0.1"
          memory: 64M

  # ---- Monitoring ----
  prometheus:
    image: prom/prometheus:v3.2.1
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
    networks:
      - monitoring
      - backend
      - frontend
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M

  grafana:
    image: grafana/grafana:11.5.2
    container_name: grafana
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    secrets:
      - grafana_password
    environment:
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_password
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: "https://grafana.example.com"
    networks:
      - monitoring
      - frontend
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    depends_on:
      prometheus:
        condition: service_started
    deploy:
      resources:
        limits:
          cpus: "0.5"
          memory: 256M
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`grafana.example.com`)"
      - "traefik.http.routers.grafana.tls.certresolver=letsencrypt"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.51.0
    container_name: cadvisor
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitoring
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    deploy:
      resources:
        limits:
          memory: 128M

  node-exporter:
    image: prom/node-exporter:v1.9.0
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/rootfs"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    networks:
      - monitoring
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    deploy:
      resources:
        limits:
          memory: 64M
# Deploy the full stack
docker compose -f docker-compose.prod.yml up -d

# Check all services are healthy
docker compose -f docker-compose.prod.yml ps

# View logs for a specific service
docker compose -f docker-compose.prod.yml logs -f api

# Scale the API
docker compose -f docker-compose.prod.yml up -d --scale api=3

# Update a single service (zero-downtime with Traefik)
docker compose -f docker-compose.prod.yml pull api
docker compose -f docker-compose.prod.yml up -d --no-deps api

# Full stack resource usage
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
views: -