Should a startup use Datadog or self-host Grafana?

At under $500/month observability budget, self-hosted Grafana Cloud free tier or Grafana + Loki + Tempo on a single VM is the best value. Datadog's per-host pricing ($15-23/host/month for infra, $1.70/GB for logs) adds up fast - a 10-host setup with moderate logging easily hits $500-1000/month. Grafana Cloud's free tier includes 50GB logs, 50GB metrics, and 50GB traces per month, which covers most early-stage startups.

What should I monitor first in a SaaS application?

Start with the four golden signals: latency (p50/p95/p99 response time), traffic (requests per second), errors (5xx rate), and saturation (CPU/memory/disk usage). Add business metrics next: signup rate, active users, and revenue-impacting endpoints. You can build a useful dashboard with just these 8-10 metrics.

Is OpenTelemetry production-ready in 2026?

Yes. OTel tracing and metrics are stable (GA) across all major languages. Logs are GA in most SDKs as of late 2025. The collector is production-hardened and used by companies processing billions of spans/day. The main gap is profiling (still experimental). For new projects, OTel is the clear choice over vendor-specific SDKs.

Observability Stack for Startups: OTel, Grafana & Beyond

Observability dashboard with metrics, logs, and traces

A well-instrumented system tells you what's wrong before your users do

Last updated: April 2026 - Covers OpenTelemetry SDK stable releases, Grafana 11, Loki 3, Tempo 2, and current vendor pricing.

The Three Pillars

Metrics tell you something is wrong. Logs tell you what went wrong. Traces tell you where it went wrong. You need all three, but you don't need all three on day one.

Pillar	What It Answers	Tool	Priority
Metrics	Is the system healthy? How fast? How loaded?	Prometheus / Grafana Mimir	🔴 Day 1
Logs	What happened? What was the error message?	Loki / Elasticsearch	🟡 Week 1
Traces	Which service/function is slow? What's the call chain?	Tempo / Jaeger	🟢 Month 1

OpenTelemetry: The Current State

OpenTelemetry is the CNCF standard for instrumentation. In 2026, it's the default choice for new projects:

Tracing: Stable (GA) in all major languages since 2023
Metrics: Stable (GA) in all major languages since 2024
Logs: Stable (GA) in most SDKs as of late 2025
Profiling: Experimental - not production-ready yet
Collector: Production-hardened, used at massive scale

The key advantage: vendor neutrality. Instrument once with OTel, export to any backend (Grafana, Datadog, New Relic, Honeycomb). Switch vendors without re-instrumenting your code.

Build vs Buy Comparison

Feature	Datadog	Grafana Cloud	New Relic	Self-Hosted
Infra Monitoring	$15/host/mo	Free (50 series)	Free (100GB/mo)	$0 (Prometheus)
Log Management	$1.70/GB ingested	Free (50GB/mo)	Free (100GB/mo)	$0 (Loki)
APM / Traces	$31/host/mo	Free (50GB/mo)	Free (100GB/mo)	$0 (Tempo)
Alerting	Included	Included	Included	Alertmanager
Dashboards	Excellent	Excellent	Good	Grafana
Setup Effort	Low (agent)	Low (agent)	Low (agent)	High (docker-compose)
Maintenance	Zero	Zero	Zero	2-4 hrs/month
10-host estimate	$500-800/mo	$0-50/mo	$0 (within free)	$20-50/mo (VM)

Budget Tiers

Budget	Stack	What You Get
$0/mo	Grafana Cloud Free + OTel	50GB logs, 50GB traces, 10K metrics series, 14-day retention
$100/mo	Self-hosted on a $50 VM + Grafana Cloud Free	Unlimited retention, full control, Grafana Cloud as backup
$500/mo	Grafana Cloud Pro	Longer retention, more volume, SLAs, SSO
$2K/mo	Datadog or Grafana Cloud Advanced	Full APM, RUM, synthetics, error tracking, ML-based alerts

Recommendation: Start at $0 with Grafana Cloud Free. It's genuinely generous and covers most startups through Series A. Move to self-hosted or paid tiers only when you hit volume limits.

What to Monitor for SaaS

The Four Golden Signals

Latency - p50, p95, p99 response time per endpoint
Traffic - Requests per second, by endpoint and status code
Errors - 5xx rate, error rate by endpoint, error types
Saturation - CPU, memory, disk, connection pool usage

SaaS Business Metrics

Signup rate (per hour/day)
Login success/failure rate
API usage by tenant (for rate limiting and billing)
Stripe webhook processing time and failure rate
Background job queue depth and processing time

Alerting Strategy

The #1 alerting mistake: Too many alerts. Alert fatigue kills on-call engineers and makes them ignore real incidents. Start with 5 alerts, not 50.

Alert	Condition	Severity
High error rate	5xx rate > 5% for 5 minutes	🔴 Page
High latency	p95 > 2s for 10 minutes	🟡 Warn
Service down	Zero successful requests for 2 minutes	🔴 Page
Disk > 85%	Any volume > 85% full	🟡 Warn
Certificate expiry	SSL cert expires in < 14 days	🟡 Warn

OTel Instrumentation (Express.js)

// tracing.js - Load BEFORE your app code
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-http");
const { OTLPMetricExporter } = require("@opentelemetry/exporter-metrics-otlp-http");
const { PeriodicExportingMetricReader } = require("@opentelemetry/sdk-metrics");
const { Resource } = require("@opentelemetry/resources");
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
  = require("@opentelemetry/semantic-conventions");

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "my-saas-api",
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.1.0",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    "@opentelemetry/instrumentation-fs": { enabled: false },
  })],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown());

Run your app with: node --require ./tracing.js app.js. This auto-instruments HTTP, Express, PostgreSQL, Redis, and gRPC with zero code changes to your app.

Self-Hosted Stack (Docker Compose)

# docker-compose.yml - Grafana + Prometheus + Loki + Tempo
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports: ["9090:9090"]
    command: ["--config.file=/etc/prometheus/prometheus.yml",
              "--storage.tsdb.retention.time=30d"]

  loki:
    image: grafana/loki:3.1.0
    volumes:
      - loki_data:/loki
    ports: ["3100:3100"]
    command: ["-config.file=/etc/loki/local-config.yaml"]

  tempo:
    image: grafana/tempo:2.5.0
    volumes:
      - ./tempo.yml:/etc/tempo.yml
      - tempo_data:/tmp/tempo
    ports: ["3200:3200", "4317:4317", "4318:4318"]
    command: ["-config.file=/etc/tempo.yml"]

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.102.0
    volumes:
      - ./otel-collector.yml:/etc/otelcol/config.yaml
    ports: ["4317:4317", "4318:4318", "8889:8889"]
    depends_on: [prometheus, loki, tempo]

  grafana:
    image: grafana/grafana:11.1.0
    volumes:
      - grafana_data:/var/lib/grafana
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: changeme
      GF_AUTH_ANONYMOUS_ENABLED: "false"
    depends_on: [prometheus, loki, tempo]

volumes:
  prometheus_data:
  loki_data:
  tempo_data:
  grafana_data:

Resource requirements: This stack runs comfortably on a 2 vCPU / 4GB RAM VM ($20-40/month on any cloud). For production, add persistent storage and backups. For high volume (>100GB/day), split Loki and Tempo onto separate VMs.

PromQL Dashboard Queries

Four Golden Signals

# Request rate (traffic)
sum(rate(http_server_request_duration_seconds_count[5m])) by (http_route)

# Error rate
sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))

# Latency p95
histogram_quantile(0.95,
  sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, http_route)
)

# Saturation - CPU usage
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

SLO Tracking (99.9% availability)

# Error budget remaining (30-day window)
1 - (
  sum(increase(http_server_request_duration_seconds_count{http_status_code=~"5.."}[30d]))
  /
  sum(increase(http_server_request_duration_seconds_count[30d]))
) / 0.001

# Burn rate (how fast you're consuming error budget)
(
  sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[1h]))
  /
  sum(rate(http_server_request_duration_seconds_count[1h]))
) / 0.001

The Bottom Line

Start with Grafana Cloud Free + OpenTelemetry auto-instrumentation. You'll have metrics, logs, and traces in an afternoon with zero ongoing cost. Build 5 alerts (not 50), track the four golden signals, and add business metrics as you grow. Self-host only when you hit free tier limits or need longer retention. The goal is answering "what's broken and why" in under 5 minutes - not building a monitoring empire.

Observability Stack for Startups: OTel, Grafana & Beyond

📑 Table of Contents

The Three Pillars

OpenTelemetry: The Current State

Build vs Buy Comparison

Budget Tiers

What to Monitor for SaaS

The Four Golden Signals

SaaS Business Metrics

Alerting Strategy

OTel Instrumentation (Express.js)

Self-Hosted Stack (Docker Compose)

PromQL Dashboard Queries

Four Golden Signals

SLO Tracking (99.9% availability)

The Bottom Line

You Might Also Like

Code Repositories, AI Coding Agents, IDEs &amp; CLIs - The 2026 Developer Guide

AWS Cost Optimization: $10K/Month to $2K/Month

AWS Bedrock for SaaS - Monetize AI, Pass Costs Downstream &amp; Profit (2026)

📚 Keep Reading

Code Repositories, AI Coding Agents, IDEs & CLIs - The 2026 Developer Guide

AWS Bedrock for SaaS - Monetize AI, Pass Costs Downstream & Profit (2026)