Skip to content
Cloud

Why I Replaced My Kubernetes Cluster With Three AWS Lambdas

Kubernetes cluster logo with an arrow pointing to three AWS Lambda icons

The Kubernetes cluster I killed last month served about 2,400 daily active users across three small web products. It ran on twelve t3.medium nodes, had four different YAML layers (Helm → Kustomize → Argo → actual manifests), and cost $1,140/month before the dev cluster I forgot was running brought the bill up to $1,580. It worked fine. It had worked fine for two years.

I replaced it with three AWS Lambda functions, an Aurora Serverless v2 instance, and an S3 bucket for static assets. The new setup costs $47 last month. It's also measurably more reliable than the thing it replaced.

This is not a Kubernetes takedown. K8s is a great tool for the cases it's good at. What I want to walk through is the quiet realization that for my specific case it was never the right tool, and the operational costs I was paying for the privilege of running it anyway.

What It Actually Ran

Three products, none remotely exotic.

  • A marketing site with a contact form
  • A Rails 7 SaaS app with Postgres backend, modest traffic (~3 req/sec peak)
  • A Node.js job runner that processed a queue of about 40,000 tasks/day

The cluster had an nginx ingress, cert-manager, external-secrets, Prometheus + Grafana, Argo CD, and a Tekton pipeline for CI. All installed because "that's how you run Kubernetes." Not one of them earned its keep for the workload it was serving.

The Tax I Was Paying

I kept the cluster for two years because nothing was visibly broken. What I didn't count was the Tuesday-shaped hole in my calendar.

Certificate renewals. cert-manager is excellent until it quietly fails, at which point your site is down until someone notices. I had a monitoring alert for it. The alert fired three times in 2024.

Ingress controller drift. The nginx ingress chart got bumped once a quarter. Two of those bumps introduced breaking changes that took a Saturday to unwind. Not catastrophic, not rare.

Node-pool autoscaling edge cases. Cluster autoscaler would occasionally refuse to scale down because a DaemonSet pod was preventing drain. Bills ticked up. I'd notice on the 20th of the month.

Helm chart incidents. Two major incidents in 2025 came from Helm values being reinterpreted after a chart major-version bump. Both took four hours each. Both were "obvious in retrospect."

The mental overhead of maintaining all of it. When something broke, "is it the app, the cluster, or the ingress" was the first 20 minutes of every debug. That's the real tax.

💡 The honest question I didn't ask for two years: was any of this actually serving traffic better than a directly-deployed Lambda would have? No. It was serving the cluster's own complexity.

What the Migration Looked Like

Two weekends and about four weeknights. Not a rewrite; a repackaging.

BEFORE
  k8s cluster (12 nodes, $1.1K/mo)
    ├── nginx ingress
    ├── marketing-site deployment (2 pods)
    ├── rails-app deployment (4 pods)
    ├── node-worker deployment (3 pods)
    ├── postgres StatefulSet (1 pod, no backups without extra work)
    └── supporting cast: cert-manager, prometheus, argo, tekton, ...

AFTER
  aws
    ├── Lambda: marketing-site (Lambda Function URL + CloudFront)
    ├── Lambda: rails-app (Lambda Web Adapter, API Gateway HTTP API)
    ├── Lambda: node-worker (triggered by SQS)
    ├── Aurora Serverless v2 Postgres (auto-scales to 0.5 ACU at night)
    └── S3 + CloudFront for static assets

Rails on Lambda is real in 2026 thanks to Lambda Web Adapter — you wrap your existing Rack app, configure a readiness probe, and deploy it like any other Lambda. Cold starts under a second on arm64. For traffic this size, nobody notices.

The worker was the simplest: Rails queue classes → SQS messages, an EventBridge rule to trigger Lambda on SQS. Lost about 30 lines of manual scheduling code in the process.

The marketing site didn't even need Lambda; it could've been pure S3 + CloudFront. It's Lambda because a small form handler had to go somewhere and bundling it with the static serve was less to maintain.

The Actual Cost

Line itemBefore (monthly)After (monthly)
Compute (nodes / Lambda)$1,140$14
Data transfer$42$11
Database (self-hosted PG / Aurora SL v2)~$0 (on cluster)$16
Object storage (S3)$3$3
CloudWatch / monitoring$0 (Prom on cluster)$3
Leftover dev cluster I forgot$440$0
Total$1,625$47

The $440 "dev cluster I forgot" line item is not typo'd. That's the actual mistake I was making and is a separate argument for not running Kubernetes: the clusters you forgot about keep costing money. Lambda you forgot about costs pennies.

Reliability Got Better, Not Worse

I expected this to be the compromise. It wasn't.

In the first two months after the migration: zero outages. Certificate management is handled by AWS-managed ACM with CloudFront. No cert-manager to fail. No ingress controller to upgrade. When a Lambda invocation fails, SQS retries it; when it fails too many times, it goes to a DLQ I get paged on. That's less sophisticated than the old Prometheus/Alertmanager setup and — for this workload — strictly more useful.

The thing nobody talks about when evaluating reliability: how much variance does the platform itself add to your incident rate? Kubernetes added a lot for me. Lambda adds almost none.

When I'd Put K8s Back

Two honest cases.

Long-running stateful services. If I were running a database, a message broker, an ML training job, or anything with meaningful in-memory state, Lambda's 15-minute ceiling and per-invocation model would be the wrong shape. Kubernetes — or EKS, or a raw EC2 Auto Scaling Group, or Fargate — makes sense there.

A team with real K8s expertise and scale. If your company is 150 engineers and 40 services and you have two SREs who live in kubectl, Kubernetes is a force multiplier. Most of the pain I described doesn't apply at that scale. The pain I described applies specifically to the middle bucket: small team, modest workload, inherited-or-aspirational cluster.

The Takeaway

Default to the simplest thing that solves your problem, and re-check that answer at least once a year. The thing I was running wasn't solving my problem; it was solving the problem of running itself. Everyone I talk to in small-to-midsize companies has a variant of this story sitting one layer deep. You don't have to kill your cluster this week. You do have to ask whether it's still earning its keep.