Skip to content
Observability dashboard with metrics, logs, and traces

A well-instrumented system tells you what's wrong before your users do

Last updated: April 2026 - Covers OpenTelemetry SDK stable releases, Grafana 11, Loki 3, Tempo 2, and current vendor pricing.

The Three Pillars

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where it went wrong. You need all three, but you don't need all three on day one.

PillarWhat It AnswersToolPriority
MetricsIs the system healthy? How fast? How loaded?Prometheus / Grafana Mimir🔴 Day 1
LogsWhat happened? What was the error message?Loki / Elasticsearch🟡 Week 1
TracesWhich service/function is slow? What's the call chain?Tempo / Jaeger🟢 Month 1

OpenTelemetry: The Current State

OpenTelemetry is the CNCF standard for instrumentation. In 2026, it's the default choice for new projects:

  • Tracing: Stable (GA) in all major languages since 2023
  • Metrics: Stable (GA) in all major languages since 2024
  • Logs: Stable (GA) in most SDKs as of late 2025
  • Profiling: Experimental - not production-ready yet
  • Collector: Production-hardened, used at massive scale

The key advantage: vendor neutrality. Instrument once with OTel, export to any backend (Grafana, Datadog, New Relic, Honeycomb). Switch vendors without re-instrumenting your code.

Build vs Buy Comparison

FeatureDatadogGrafana CloudNew RelicSelf-Hosted
Infra Monitoring$15/host/moFree (50 series)Free (100GB/mo)$0 (Prometheus)
Log Management$1.70/GB ingestedFree (50GB/mo)Free (100GB/mo)$0 (Loki)
APM / Traces$31/host/moFree (50GB/mo)Free (100GB/mo)$0 (Tempo)
AlertingIncludedIncludedIncludedAlertmanager
DashboardsExcellentExcellentGoodGrafana
Setup EffortLow (agent)Low (agent)Low (agent)High (docker-compose)
MaintenanceZeroZeroZero2-4 hrs/month
10-host estimate$500-800/mo$0-50/mo$0 (within free)$20-50/mo (VM)

Budget Tiers

BudgetStackWhat You Get
$0/moGrafana Cloud Free + OTel50GB logs, 50GB traces, 10K metrics series, 14-day retention
$100/moSelf-hosted on a $50 VM + Grafana Cloud FreeUnlimited retention, full control, Grafana Cloud as backup
$500/moGrafana Cloud ProLonger retention, more volume, SLAs, SSO
$2K/moDatadog or Grafana Cloud AdvancedFull APM, RUM, synthetics, error tracking, ML-based alerts
Recommendation: Start at $0 with Grafana Cloud Free. It's genuinely generous and covers most startups through Series A. Move to self-hosted or paid tiers only when you hit volume limits.

What to Monitor for SaaS

The Four Golden Signals

  1. Latency - p50, p95, p99 response time per endpoint
  2. Traffic - Requests per second, by endpoint and status code
  3. Errors - 5xx rate, error rate by endpoint, error types
  4. Saturation - CPU, memory, disk, connection pool usage

SaaS Business Metrics

  • Signup rate (per hour/day)
  • Login success/failure rate
  • API usage by tenant (for rate limiting and billing)
  • Stripe webhook processing time and failure rate
  • Background job queue depth and processing time

Alerting Strategy

The #1 alerting mistake: Too many alerts. Alert fatigue kills on-call engineers and makes them ignore real incidents. Start with 5 alerts, not 50.
AlertConditionSeverity
High error rate5xx rate > 5% for 5 minutes🔴 Page
High latencyp95 > 2s for 10 minutes🟡 Warn
Service downZero successful requests for 2 minutes🔴 Page
Disk > 85%Any volume > 85% full🟡 Warn
Certificate expirySSL cert expires in < 14 days🟡 Warn

OTel Instrumentation (Express.js)

// tracing.js - Load BEFORE your app code
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { OTLPMetricExporter } = require("@opentelemetry/exporter-metrics-otlp-http");
const { PeriodicExportingMetricReader } = require("@opentelemetry/sdk-metrics");
const { Resource } = require("@opentelemetry/resources");
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
  = require("@opentelemetry/semantic-conventions");

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "my-saas-api",
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.1.0",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    "@opentelemetry/instrumentation-fs": { enabled: false },
  })],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown());

Run your app with: node --require ./tracing.js app.js. This auto-instruments HTTP, Express, PostgreSQL, Redis, and gRPC with zero code changes to your app.

Self-Hosted Stack (Docker Compose)

# docker-compose.yml - Grafana + Prometheus + Loki + Tempo
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports: ["9090:9090"]
    command: ["--config.file=/etc/prometheus/prometheus.yml",
              "--storage.tsdb.retention.time=30d"]

  loki:
    image: grafana/loki:3.1.0
    volumes:
      - loki_data:/loki
    ports: ["3100:3100"]
    command: ["-config.file=/etc/loki/local-config.yaml"]

  tempo:
    image: grafana/tempo:2.5.0
    volumes:
      - ./tempo.yml:/etc/tempo.yml
      - tempo_data:/tmp/tempo
    ports: ["3200:3200", "4317:4317", "4318:4318"]
    command: ["-config.file=/etc/tempo.yml"]

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.102.0
    volumes:
      - ./otel-collector.yml:/etc/otelcol/config.yaml
    ports: ["4317:4317", "4318:4318", "8889:8889"]
    depends_on: [prometheus, loki, tempo]

  grafana:
    image: grafana/grafana:11.1.0
    volumes:
      - grafana_data:/var/lib/grafana
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    depends_on: [prometheus, loki, tempo]

volumes:
  prometheus_data:
  loki_data:
  tempo_data:
  grafana_data:
Resource requirements: This stack runs comfortably on a 2 vCPU / 4GB RAM VM ($20-40/month on any cloud). For production, add persistent storage and backups. For high volume (>100GB/day), split Loki and Tempo onto separate VMs.

PromQL Dashboard Queries

Four Golden Signals

# Request rate (traffic)
sum(rate(http_server_request_duration_seconds_count[5m])) by (http_route)

# Error rate
sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))

# Latency p95
histogram_quantile(0.95,
  sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, http_route)
)

# Saturation - CPU usage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

SLO Tracking (99.9% availability)

# Error budget remaining (30-day window)
1 - (
  sum(increase(http_server_request_duration_seconds_count{http_status_code=~"5.."}[30d]))
  /
  sum(increase(http_server_request_duration_seconds_count[30d]))
) / 0.001

# Burn rate (how fast you're consuming error budget)
(
  sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[1h]))
  /
  sum(rate(http_server_request_duration_seconds_count[1h]))
) / 0.001

The Bottom Line

Start with Grafana Cloud Free + OpenTelemetry auto-instrumentation. You'll have metrics, logs, and traces in an afternoon with zero ongoing cost. Build 5 alerts (not 50), track the four golden signals, and add business metrics as you grow. Self-host only when you hit free tier limits or need longer retention. The goal is answering "what's broken and why" in under 5 minutes - not building a monitoring empire.