Observability Stack for Startups: OTel, Grafana & Beyond
The three pillars, the tools, the budget tiers, and the exact docker-compose and PromQL queries to get production observability running this week.
A well-instrumented system tells you what's wrong before your users do
The Three Pillars
Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where it went wrong. You need all three, but you don't need all three on day one.
| Pillar | What It Answers | Tool | Priority |
|---|---|---|---|
| Metrics | Is the system healthy? How fast? How loaded? | Prometheus / Grafana Mimir | 🔴 Day 1 |
| Logs | What happened? What was the error message? | Loki / Elasticsearch | 🟡 Week 1 |
| Traces | Which service/function is slow? What's the call chain? | Tempo / Jaeger | 🟢 Month 1 |
OpenTelemetry: The Current State
OpenTelemetry is the CNCF standard for instrumentation. In 2026, it's the default choice for new projects:
- Tracing: Stable (GA) in all major languages since 2023
- Metrics: Stable (GA) in all major languages since 2024
- Logs: Stable (GA) in most SDKs as of late 2025
- Profiling: Experimental - not production-ready yet
- Collector: Production-hardened, used at massive scale
The key advantage: vendor neutrality. Instrument once with OTel, export to any backend (Grafana, Datadog, New Relic, Honeycomb). Switch vendors without re-instrumenting your code.
Build vs Buy Comparison
| Feature | Datadog | Grafana Cloud | New Relic | Self-Hosted |
|---|---|---|---|---|
| Infra Monitoring | $15/host/mo | Free (50 series) | Free (100GB/mo) | $0 (Prometheus) |
| Log Management | $1.70/GB ingested | Free (50GB/mo) | Free (100GB/mo) | $0 (Loki) |
| APM / Traces | $31/host/mo | Free (50GB/mo) | Free (100GB/mo) | $0 (Tempo) |
| Alerting | Included | Included | Included | Alertmanager |
| Dashboards | Excellent | Excellent | Good | Grafana |
| Setup Effort | Low (agent) | Low (agent) | Low (agent) | High (docker-compose) |
| Maintenance | Zero | Zero | Zero | 2-4 hrs/month |
| 10-host estimate | $500-800/mo | $0-50/mo | $0 (within free) | $20-50/mo (VM) |
Budget Tiers
| Budget | Stack | What You Get |
|---|---|---|
| $0/mo | Grafana Cloud Free + OTel | 50GB logs, 50GB traces, 10K metrics series, 14-day retention |
| $100/mo | Self-hosted on a $50 VM + Grafana Cloud Free | Unlimited retention, full control, Grafana Cloud as backup |
| $500/mo | Grafana Cloud Pro | Longer retention, more volume, SLAs, SSO |
| $2K/mo | Datadog or Grafana Cloud Advanced | Full APM, RUM, synthetics, error tracking, ML-based alerts |
What to Monitor for SaaS
The Four Golden Signals
- Latency - p50, p95, p99 response time per endpoint
- Traffic - Requests per second, by endpoint and status code
- Errors - 5xx rate, error rate by endpoint, error types
- Saturation - CPU, memory, disk, connection pool usage
SaaS Business Metrics
- Signup rate (per hour/day)
- Login success/failure rate
- API usage by tenant (for rate limiting and billing)
- Stripe webhook processing time and failure rate
- Background job queue depth and processing time
Alerting Strategy
| Alert | Condition | Severity |
|---|---|---|
| High error rate | 5xx rate > 5% for 5 minutes | 🔴 Page |
| High latency | p95 > 2s for 10 minutes | 🟡 Warn |
| Service down | Zero successful requests for 2 minutes | 🔴 Page |
| Disk > 85% | Any volume > 85% full | 🟡 Warn |
| Certificate expiry | SSL cert expires in < 14 days | 🟡 Warn |
OTel Instrumentation (Express.js)
// tracing.js - Load BEFORE your app code
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { OTLPMetricExporter } = require("@opentelemetry/exporter-metrics-otlp-http");
const { PeriodicExportingMetricReader } = require("@opentelemetry/sdk-metrics");
const { Resource } = require("@opentelemetry/resources");
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
= require("@opentelemetry/semantic-conventions");
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: "my-saas-api",
[ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.1.0",
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 15000,
}),
instrumentations: [getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-fs": { enabled: false },
})],
});
sdk.start();
process.on("SIGTERM", () => sdk.shutdown());
Run your app with: node --require ./tracing.js app.js. This auto-instruments HTTP, Express, PostgreSQL, Redis, and gRPC with zero code changes to your app.
Self-Hosted Stack (Docker Compose)
# docker-compose.yml - Grafana + Prometheus + Loki + Tempo
services:
prometheus:
image: prom/prometheus:v2.53.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports: ["9090:9090"]
command: ["--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.retention.time=30d"]
loki:
image: grafana/loki:3.1.0
volumes:
- loki_data:/loki
ports: ["3100:3100"]
command: ["-config.file=/etc/loki/local-config.yaml"]
tempo:
image: grafana/tempo:2.5.0
volumes:
- ./tempo.yml:/etc/tempo.yml
- tempo_data:/tmp/tempo
ports: ["3200:3200", "4317:4317", "4318:4318"]
command: ["-config.file=/etc/tempo.yml"]
otel-collector:
image: otel/opentelemetry-collector-contrib:0.102.0
volumes:
- ./otel-collector.yml:/etc/otelcol/config.yaml
ports: ["4317:4317", "4318:4318", "8889:8889"]
depends_on: [prometheus, loki, tempo]
grafana:
image: grafana/grafana:11.1.0
volumes:
- grafana_data:/var/lib/grafana
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: changeme
GF_AUTH_ANONYMOUS_ENABLED: "false"
depends_on: [prometheus, loki, tempo]
volumes:
prometheus_data:
loki_data:
tempo_data:
grafana_data:
PromQL Dashboard Queries
Four Golden Signals
# Request rate (traffic)
sum(rate(http_server_request_duration_seconds_count[5m])) by (http_route)
# Error rate
sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))
# Latency p95
histogram_quantile(0.95,
sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, http_route)
)
# Saturation - CPU usage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
SLO Tracking (99.9% availability)
# Error budget remaining (30-day window)
1 - (
sum(increase(http_server_request_duration_seconds_count{http_status_code=~"5.."}[30d]))
/
sum(increase(http_server_request_duration_seconds_count[30d]))
) / 0.001
# Burn rate (how fast you're consuming error budget)
(
sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[1h]))
/
sum(rate(http_server_request_duration_seconds_count[1h]))
) / 0.001
The Bottom Line
Start with Grafana Cloud Free + OpenTelemetry auto-instrumentation. You'll have metrics, logs, and traces in an afternoon with zero ongoing cost. Build 5 alerts (not 50), track the four golden signals, and add business metrics as you grow. Self-host only when you hit free tier limits or need longer retention. The goal is answering "what's broken and why" in under 5 minutes - not building a monitoring empire.