AWS ECS Fargate Production Guide 2026 - Architecture, Scaling, Security and Cost Optimization
Everything you need to run production containers on ECS Fargate - from task definitions and capacity providers to blue/green deploys, observability, security hardening, and saving 70%+ with Spot and Graviton.
ECS Fargate removes the need to manage EC2 instances - you define tasks and AWS handles the rest
ECS vs EKS - The Decision Framework
ECS and EKS both run containers on AWS, but they solve different problems. ECS is AWS's proprietary container orchestrator. EKS is managed Kubernetes. The right choice depends on your team's skills, portability requirements, and operational appetite.
ECS has zero control plane cost. EKS charges $0.10/hour ($73/month) per cluster just for the Kubernetes API server. For small teams running a handful of services, that difference adds up fast.
| Criteria | ECS | EKS |
|---|---|---|
| Control plane cost | Free | $0.10/hr ($73/mo) |
| Learning curve | Low - AWS-native concepts | High - full Kubernetes API |
| Ecosystem | AWS-only integrations | Helm, operators, CNCF tools |
| Portability | AWS lock-in | Multi-cloud possible |
| Service mesh | Service Connect (built-in) | Istio, Linkerd, App Mesh |
| Auto-scaling | Application Auto Scaling | Karpenter, HPA, VPA, KEDA |
| Blue/green deploys | CodeDeploy native | Argo Rollouts, Flagger |
| Debugging | ECS Exec (SSM) | kubectl exec |
| GPU support | Yes (EC2 launch type) | Yes (managed node groups) |
| Windows containers | Yes | Yes |
ECS shines when you want tight AWS integration without managing Kubernetes complexity. CloudFormation and Terraform both have first-class ECS support. Load balancer target groups, service discovery, secrets, and IAM roles all wire up natively without extra controllers or CRDs.
EKS wins when you need the Kubernetes ecosystem. Helm charts, custom operators, GitOps with ArgoCD, and the ability to run the same manifests on GKE or AKS. The tradeoff is operational complexity - you are responsible for node groups, add-ons, RBAC policies, and cluster upgrades.
When ECS is the clear winner
- Teams under 10 engineers with no Kubernetes experience
- Startups that need to ship fast without infrastructure overhead
- Workloads that are 100% AWS and will stay that way
- Simple microservice architectures (under 20 services)
- Cost-sensitive environments where $73/month per cluster matters
When EKS is the clear winner
- Teams with existing Kubernetes expertise and tooling
- Multi-cloud or hybrid-cloud requirements
- Complex service mesh needs beyond what Service Connect provides
- Heavy use of Helm charts and Kubernetes operators
- Organizations with 50+ microservices and dedicated platform teams
Fargate Pricing 2026
Fargate pricing is per-second with a one-minute minimum. You pay for the vCPU and memory your task definition requests, not what the container actually uses. Getting your resource requests right is the single biggest lever for cost control.
All prices below are for US East (N. Virginia) as of May 2026. Other regions vary by 10-25%.
| Resource | On-Demand | Spot | Savings Plan (1yr) |
|---|---|---|---|
| vCPU per hour | $0.04048 | $0.01214 | $0.01943 |
| GB memory per hour | $0.004445 | $0.001334 | $0.002134 |
| Ephemeral storage (above 20 GB) | $0.000111/GB/hr | $0.000111/GB/hr | N/A |
| 1 vCPU / 2 GB task (monthly) | $35.74 | $10.72 | $17.16 |
| 2 vCPU / 4 GB task (monthly) | $71.48 | $21.44 | $34.31 |
| 4 vCPU / 8 GB task (monthly) | $142.96 | $42.89 | $68.62 |
Fargate vs EC2 launch type cost comparison
For steady-state workloads running 24/7, EC2 launch type with Reserved Instances can be 40-60% cheaper than Fargate on-demand. But Fargate eliminates patching, AMI updates, and capacity planning. The operational savings often outweigh the compute premium, especially for teams without dedicated infrastructure engineers.
The break-even point depends on utilization. If your EC2 instances run above 70% CPU utilization consistently, EC2 launch type wins on cost. Below 50% utilization, Fargate's per-task billing usually wins because you are not paying for idle capacity.
ARM64 (Graviton) pricing advantage
Fargate tasks running on ARM64 (Graviton3) processors cost 20% less than x86 equivalents with comparable or better performance. A 1 vCPU / 2 GB ARM64 task costs $28.59/month versus $35.74 for x86. For most web services and API workloads, switching to ARM64 requires only changing the runtimePlatform in your task definition and rebuilding your container image for linux/arm64.
ECS Architecture Deep Dive
ECS has four core building blocks: clusters, task definitions, tasks, and services. Understanding how they fit together is essential before deploying anything to production.
Task definitions
A task definition is a blueprint for your application. It specifies container images, CPU/memory limits, port mappings, environment variables, logging configuration, and IAM roles. Think of it as a docker-compose file that AWS understands.
{
"family": "api-service",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"runtimePlatform": {
"cpuArchitecture": "ARM64",
"operatingSystemFamily": "LINUX"
},
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/api-service-task-role",
"containerDefinitions": [
{
"name": "api",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3",
"portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "api"
}
},
"secrets": [
{"name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password"}
],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
Key details that trip people up:
- networkMode must be awsvpc for Fargate. Each task gets its own ENI and private IP.
- cpu and memory are strings at the task level, integers at the container level.
- executionRoleArn is for ECS agent operations (pulling images, writing logs). taskRoleArn is for your application code (calling S3, DynamoDB, etc.).
- Fargate supports specific CPU/memory combinations. You cannot request arbitrary values.
Services
An ECS service maintains a desired count of running tasks and integrates with load balancers, service discovery, and deployment controllers. It is the layer that makes your containers production-ready.
Services handle rolling updates, health check grace periods, circuit breakers, and capacity provider strategies. A well-configured service definition is the difference between a toy deployment and a production system.
Capacity providers
Capacity providers tell ECS where to place tasks. For Fargate, you have two built-in providers: FARGATE (on-demand) and FARGATE_SPOT. You can mix them with a strategy:
{
"capacityProviderStrategy": [
{"capacityProvider": "FARGATE", "weight": 1, "base": 2},
{"capacityProvider": "FARGATE_SPOT", "weight": 3, "base": 0}
]
}
This configuration keeps a minimum of 2 on-demand tasks (the base) and distributes additional tasks 75% to Spot and 25% to on-demand. The base tasks guarantee availability even during Spot capacity shortages.
Networking with awsvpc
Every Fargate task gets its own elastic network interface (ENI) in your VPC. This means tasks have real private IPs, can be placed in private subnets, and are subject to security group rules. The downside is ENI density limits per subnet - plan your CIDR blocks accordingly. A /24 subnet supports roughly 250 tasks.
For public-facing services, place tasks in private subnets behind an Application Load Balancer in public subnets. Tasks should never have public IPs in production.
Production Deployment Patterns
ECS supports three deployment controllers: rolling update (default), blue/green via CodeDeploy, and external controllers. Each has tradeoffs between complexity, rollback speed, and blast radius.
Rolling updates
The default deployment type. ECS drains old tasks and starts new ones in batches controlled by minimumHealthyPercent and maximumPercent. For a service with 4 tasks:
minimumHealthyPercent: 50- ECS can stop 2 tasks before starting replacementsmaximumPercent: 200- ECS can run up to 8 tasks during deployment
Rolling updates are simple and work well for most services. The downside is slow rollbacks - if the new version is broken, you must deploy the old version forward rather than instantly switching back.
Blue/green with CodeDeploy
Blue/green deployments run the new version alongside the old version and shift traffic atomically. ECS integrates with CodeDeploy to manage this through ALB target group switching.
# appspec.yaml for ECS blue/green
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: "arn:aws:ecs:us-east-1:123456789012:task-definition/api:42"
LoadBalancerInfo:
ContainerName: "api"
ContainerPort: 8080
Hooks:
- BeforeAllowTraffic: "LambdaValidationFunction"
- AfterAllowTraffic: "LambdaSmokeTestFunction"
CodeDeploy supports three traffic shifting strategies:
| Strategy | Behavior | Best for |
|---|---|---|
| AllAtOnce | 100% traffic shift immediately | Dev/staging environments |
| Linear10PercentEvery1Minute | 10% shift every minute | Low-risk production deploys |
| Canary10Percent5Minutes | 10% for 5 min, then 100% | High-risk changes |
The killer feature is instant rollback. If your validation Lambda or CloudWatch alarms detect problems, CodeDeploy shifts traffic back to the blue target group in seconds. No redeployment needed.
Circuit breakers
ECS deployment circuit breakers automatically roll back failed deployments without manual intervention. Enable them in your service definition:
{
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
},
"minimumHealthyPercent": 100,
"maximumPercent": 200
}
}
The circuit breaker triggers when ECS cannot reach the desired task count after multiple attempts. It tracks the ratio of failed task launches to successful ones. If the failure threshold is breached, ECS automatically rolls back to the last stable deployment. This is a must-have for any production service.
Service discovery with Cloud Map
AWS Cloud Map provides DNS-based and API-based service discovery for ECS services. When a task starts, ECS registers it with Cloud Map. Other services resolve the name via private DNS.
# Tasks register as: api.production.local
# Other services connect via DNS
curl http://api.production.local:8080/health
For more advanced service-to-service communication, ECS Service Connect (GA since 2023) provides a built-in service mesh powered by Envoy. It handles load balancing, retries, timeouts, and observability between services without requiring you to manage Envoy configuration directly.
Service Connect is the recommended approach for new ECS deployments. It replaces the older App Mesh service and provides better integration with ECS task networking. See the Service Connect documentation for setup details.
ECS Exec - Debugging Running Containers
ECS Exec gives you interactive shell access to running Fargate containers using AWS Systems Manager (SSM). No SSH, no bastion hosts, no sidecar containers. It works by injecting the SSM agent into your task at runtime.
Enabling ECS Exec
# Enable on an existing service
aws ecs update-service \
--cluster production \
--service api-service \
--enable-execute-command
# Start a new task with exec enabled
aws ecs run-task \
--cluster production \
--task-definition api-service:42 \
--enable-execute-command \
--network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-abc123]}"
Connecting to a container
# Interactive shell
aws ecs execute-command \
--cluster production \
--task arn:aws:ecs:us-east-1:123456789012:task/production/abc123 \
--container api \
--interactive \
--command "/bin/sh"
# Run a single command
aws ecs execute-command \
--cluster production \
--task arn:aws:ecs:us-east-1:123456789012:task/production/abc123 \
--container api \
--interactive \
--command "cat /app/config.yaml"
Prerequisites
- The task execution role needs
ssmmessages:CreateControlChannel,ssmmessages:CreateDataChannel,ssmmessages:OpenControlChannel, andssmmessages:OpenDataChannelpermissions - Install the Session Manager plugin on your local machine
- The task must be in RUNNING state
- Platform version 1.4.0 or later (current default is 1.4.0)
Troubleshooting ECS Exec
The most common failure is the SSM agent not starting inside the container. Use the amazon-ecs-exec-checker script to diagnose:
# Install the checker
git clone https://github.com/aws-containers/amazon-ecs-exec-checker.git
cd amazon-ecs-exec-checker
# Run diagnostics
./check-ecs-exec.sh production abc123def456
Common issues: VPC endpoints missing for ssmmessages (required if tasks are in private subnets without NAT), task role missing SSM permissions, or the container running as a non-root user without the required capabilities.
Auto-Scaling Strategies
ECS uses Application Auto Scaling to adjust the desired count of tasks in a service. You can scale on CPU, memory, custom CloudWatch metrics, or ALB request count. The right strategy depends on your workload pattern.
Target tracking (recommended default)
Target tracking is the simplest and most effective scaling policy for most services. You set a target value and ECS adjusts task count to maintain it.
# Scale to maintain 60% average CPU utilization
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--resource-id service/production/api-service \
--scalable-dimension ecs:service:DesiredCount \
--min-capacity 2 \
--max-capacity 20
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--resource-id service/production/api-service \
--scalable-dimension ecs:service:DesiredCount \
--policy-name cpu-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 60.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
Key settings to tune:
- ScaleOutCooldown: 60 seconds - scale up fast when load increases
- ScaleInCooldown: 300 seconds - scale down slowly to avoid flapping
- Target value: 60-70% for CPU - leaves headroom for traffic spikes
ALB request count per target
For web services behind an ALB, scaling on request count per target is often more responsive than CPU-based scaling. It reacts to traffic increases before CPU utilization climbs.
{
"TargetValue": 1000.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/my-alb/abc123/targetgroup/my-tg/def456"
}
}
Step scaling for bursty workloads
Step scaling lets you define different scaling actions at different alarm thresholds. Useful for workloads with sudden traffic spikes where you need aggressive scale-out:
- CPU 60-75%: add 2 tasks
- CPU 75-90%: add 4 tasks
- CPU above 90%: add 8 tasks
Scheduled scaling
If your traffic follows predictable patterns (business hours, batch processing windows), scheduled scaling pre-provisions capacity before the load arrives:
# Scale up for business hours (8 AM EST)
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/production/api-service \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name scale-up-morning \
--schedule "cron(0 13 ? * MON-FRI *)" \
--scalable-target-action MinCapacity=6,MaxCapacity=20
# Scale down for evenings (8 PM EST)
aws application-autoscaling put-scheduled-action \
--service-namespace ecs \
--resource-id service/production/api-service \
--scalable-dimension ecs:service:DesiredCount \
--scheduled-action-name scale-down-evening \
--schedule "cron(0 1 ? * TUE-SAT *)" \
--scalable-target-action MinCapacity=2,MaxCapacity=10
Observability - Container Insights, X-Ray, and FireLens
Running containers without observability is flying blind. ECS integrates with three AWS services that together provide metrics, traces, and structured logs for your production workloads. For a deeper dive into observability tooling, see our Observability Stack guide.
Container Insights
Container Insights collects CPU, memory, network, and storage metrics at the task and service level. The enhanced observability mode (launched 2024) adds container-level metrics and automatic dashboard generation.
# Enable Container Insights on a cluster
aws ecs update-cluster-settings \
--cluster production \
--settings name=containerInsights,value=enhanced
Enhanced Container Insights provides:
- Per-container CPU and memory utilization (not just per-task)
- Network bytes in/out per container
- Storage read/write operations
- Automatic CloudWatch dashboards with service maps
- Performance anomaly detection with ML-based alerts
The cost is $0.01 per Container Insights metric per month. For a cluster with 20 services, expect roughly $15-25/month in Container Insights charges. Worth every penny compared to debugging blind.
AWS X-Ray for distributed tracing
X-Ray traces requests across your microservices, showing latency breakdowns, error rates, and dependency maps. For ECS Fargate, deploy the X-Ray daemon as a sidecar container:
{
"name": "xray-daemon",
"image": "public.ecr.aws/xray/aws-xray-daemon:latest",
"cpu": 32,
"memoryReservation": 64,
"portMappings": [{"containerPort": 2000, "protocol": "udp"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/xray-daemon",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "xray"
}
}
}
Your application sends trace segments to localhost:2000 via UDP. The X-Ray SDK is available for Java, Python, Node.js, Go, .NET, and Ruby. For languages without an SDK, use the OpenTelemetry collector with the X-Ray exporter instead.
FireLens for log routing
FireLens is an ECS-native log router built on Fluent Bit. Instead of sending all logs to CloudWatch Logs (expensive at scale), FireLens lets you route logs to S3, Elasticsearch, Datadog, Splunk, or any Fluent Bit output plugin.
{
"name": "log-router",
"image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"config-file-type": "file",
"config-file-value": "/fluent-bit/configs/output.conf"
}
},
"essential": true,
"cpu": 64,
"memoryReservation": 128
}
A common production pattern routes logs to three destinations simultaneously:
- CloudWatch Logs - for real-time debugging and CloudWatch Logs Insights queries
- S3 - for long-term retention and compliance (much cheaper than CloudWatch)
- OpenSearch - for full-text search and dashboards
Security Best Practices
Container security on ECS spans IAM roles, secrets management, network isolation, and image scanning. Getting these right from day one prevents the kind of security incidents that wake you up at 3 AM. For a broader view, see our Cloud Security guide.
Task roles vs execution roles
This is the most commonly confused aspect of ECS security. There are two distinct IAM roles:
- Execution role - used by the ECS agent to pull container images from ECR, write logs to CloudWatch, and retrieve secrets. This role is never available to your application code.
- Task role - assumed by your application code at runtime. If your API needs to read from S3 or write to DynamoDB, those permissions go on the task role.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-app-bucket/*"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:Query"
],
"Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/my-app-table"
}
]
}
Follow the principle of least privilege. Each service should have its own task role with only the permissions it needs. Never share task roles across services.
Secrets management
Never bake secrets into container images or pass them as plain-text environment variables. ECS integrates with both AWS Secrets Manager and SSM Parameter Store to inject secrets at task startup:
{
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password"
},
{
"name": "API_KEY",
"valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/api-key"
}
]
}
Secrets Manager costs $0.40/secret/month plus $0.05 per 10,000 API calls. SSM Parameter Store SecureString parameters are free for standard throughput. Use Secrets Manager for credentials that need automatic rotation. Use SSM Parameter Store for everything else.
VPC and network security
- Private subnets only - Fargate tasks should never have public IPs. Use a NAT Gateway or VPC endpoints for outbound access.
- Security groups per service - each ECS service should have its own security group. Allow only the ports and sources that service needs.
- VPC endpoints - for private subnets, create endpoints for ECR (ecr.api, ecr.dkr), S3 (gateway), CloudWatch Logs, Secrets Manager, and SSM. This keeps traffic off the public internet and avoids NAT Gateway data processing charges.
- Network ACLs - use as a secondary defense layer. Keep them simple - deny known bad CIDR ranges and allow everything else at the NACL level. Let security groups handle fine-grained access control.
Image security
- Enable ECR image scanning on push. It checks for CVEs in OS packages and language dependencies.
- Use minimal base images (distroless, Alpine, or scratch) to reduce attack surface.
- Pin image tags to immutable digests in production:
image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api@sha256:abc123... - Enable ECR lifecycle policies to automatically clean up untagged and old images.
- Run containers as a non-root user. Add
USER 1000to your Dockerfile.
CI/CD with GitHub Actions
A production ECS pipeline builds the container image, pushes it to ECR, updates the task definition, and deploys the new service version. GitHub Actions has official AWS actions that make this straightforward.
Complete workflow
# .github/workflows/deploy.yml
name: Deploy to ECS
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
env:
AWS_REGION: us-east-1
ECR_REPOSITORY: api-service
ECS_CLUSTER: production
ECS_SERVICE: api-service
TASK_DEFINITION: .aws/task-definition.json
CONTAINER_NAME: api
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: ${{ env.AWS_REGION }}
- name: Login to ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push image
id: build
env:
ECR_REGISTRY: ${{ steps.ecr-login.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build \
--platform linux/arm64 \
-t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
-t $ECR_REGISTRY/$ECR_REPOSITORY:latest \
.
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Update task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: ${{ env.TASK_DEFINITION }}
container-name: ${{ env.CONTAINER_NAME }}
image: ${{ steps.build.outputs.image }}
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v2
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
wait-for-minutes: 10
Key details
- OIDC authentication - use
role-to-assumeinstead of storing AWS access keys as secrets. OIDC is more secure and eliminates key rotation. - ARM64 builds - the
--platform linux/arm64flag builds for Graviton. If your GitHub runner is x86, enable Docker BuildKit with QEMU emulation or use a self-hosted ARM runner. - wait-for-service-stability - the deploy step waits until the new tasks pass health checks and the old tasks drain. If the deployment fails, the step fails and you get a clear error in your PR.
- Image tagging - use the git SHA as the primary tag for traceability. Push :latest as a convenience tag but never reference it in task definitions.
Blue/green variant
For CodeDeploy blue/green deployments, replace the final step with:
- name: Deploy (blue/green)
uses: aws-actions/amazon-ecs-deploy-task-definition@v2
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
codedeploy-appspec: .aws/appspec.yaml
codedeploy-application: api-service
codedeploy-deployment-group: api-service-prod
This triggers a CodeDeploy deployment instead of a rolling update, giving you canary analysis and instant rollback capabilities.
Cost Optimization
Container costs on AWS can spiral quickly if you are not deliberate about optimization. The three biggest levers are Fargate Spot, ARM64/Graviton, and Compute Savings Plans. Combined, they can reduce your bill by 70-80%. For a comprehensive approach, see our AWS Cost Optimization guide.
Fargate Spot - 70% savings
Fargate Spot uses spare AWS capacity at a 70% discount. Tasks can be interrupted with a 30-second SIGTERM warning when AWS needs the capacity back. Use Spot for:
- Queue workers and background job processors
- Batch data processing pipelines
- Development and staging environments
- Stateless web services with multiple replicas (behind a load balancer)
Do not use Spot for singleton tasks, stateful workloads, or anything that cannot tolerate a 30-second shutdown. The capacity provider strategy shown earlier (base of 2 on-demand, rest on Spot) is the recommended pattern for production web services.
ARM64/Graviton - 20% savings
AWS Graviton3 processors deliver 20% lower cost and up to 40% better price-performance compared to x86 for most workloads. Switching requires two changes:
- Update your task definition's
runtimePlatform.cpuArchitecturetoARM64 - Build your container image for
linux/arm64
# Multi-arch Dockerfile
FROM --platform=$TARGETPLATFORM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
USER 1000
EXPOSE 8080
CMD ["node", "server.js"]
# Build multi-arch image with Docker BuildKit
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t 123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3 \
--push .
Most Node.js, Python, Go, and Java applications work on ARM64 without code changes. Watch out for native binary dependencies (some npm packages with C extensions, Python packages with compiled wheels). Test thoroughly before switching production.
Compute Savings Plans - up to 52% savings
Compute Savings Plans commit to a consistent amount of compute spend (measured in $/hour) for 1 or 3 years. They apply automatically to Fargate, Lambda, and EC2 usage across all regions.
| Plan type | Term | Payment | Discount |
|---|---|---|---|
| Compute Savings Plan | 1 year | All upfront | 37% |
| Compute Savings Plan | 1 year | No upfront | 29% |
| Compute Savings Plan | 3 year | All upfront | 52% |
| Compute Savings Plan | 3 year | No upfront | 43% |
Start by analyzing your last 30 days of Fargate usage in Cost Explorer. Look at the Savings Plans recommendations page - it calculates the optimal commitment level based on your actual usage patterns. Commit to 70-80% of your baseline usage and let the rest run on-demand for flexibility.
Right-sizing task definitions
Over-provisioned task definitions are the silent budget killer. Use Container Insights metrics to compare requested vs actual CPU and memory usage:
-- CloudWatch Logs Insights query for right-sizing
SELECT avg(CpuUtilized) as avg_cpu,
max(CpuUtilized) as max_cpu,
avg(MemoryUtilized) as avg_mem_mb,
max(MemoryUtilized) as max_mem_mb
FROM "/aws/ecs/containerinsights/production/performance"
WHERE ServiceName = 'api-service'
AND Type = 'Task'
GROUP BY bin(1h)
If your task requests 1024 CPU units but peaks at 400, drop it to 512. If memory peaks at 600 MB but you allocated 2048 MB, drop to 1024 MB. Each reduction directly lowers your per-second Fargate bill.
Combined savings example
| Optimization | Monthly cost (10 tasks, 1 vCPU / 2 GB) | Savings |
|---|---|---|
| Baseline (x86, on-demand) | $357.40 | - |
| + ARM64 | $285.92 | 20% |
| + Spot (70% of tasks) | $160.73 | 55% |
| + Savings Plan (on-demand portion) | $118.14 | 67% |
| + Right-sizing (0.5 vCPU / 1 GB) | $59.07 | 83% |
From $357/month to $59/month for the same workload. That is an 83% reduction by stacking four optimization strategies.
Complete Terraform Example
Here is a production-ready Terraform configuration that ties together everything covered in this guide: Fargate with Spot capacity providers, ALB, auto-scaling, Container Insights, security groups, and secrets. For more on infrastructure as code, see our IaC Guide.
# providers.tf
terraform {
required_version = ">= 1.8"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.50"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "ecs/production/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = "us-east-1"
}
# variables.tf
variable "app_name" {
default = "api-service"
}
variable "environment" {
default = "production"
}
variable "container_image" {
description = "Full ECR image URI with tag"
type = string
}
variable "vpc_id" {
type = string
}
variable "private_subnet_ids" {
type = list(string)
}
variable "public_subnet_ids" {
type = list(string)
}
# ecs-cluster.tf
resource "aws_ecs_cluster" "main" {
name = var.environment
setting {
name = "containerInsights"
value = "enhanced"
}
}
resource "aws_ecs_cluster_capacity_providers" "main" {
cluster_name = aws_ecs_cluster.main.name
capacity_providers = ["FARGATE", "FARGATE_SPOT"]
default_capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1
base = 2
}
default_capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 3
}
}
# iam.tf
resource "aws_iam_role" "execution" {
name = "${var.app_name}-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy_attachment" "execution" {
role = aws_iam_role.execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
resource "aws_iam_role_policy" "execution_secrets" {
name = "secrets-access"
role = aws_iam_role.execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = ["arn:aws:secretsmanager:us-east-1:*:secret:${var.environment}/*"]
}]
})
}
resource "aws_iam_role" "task" {
name = "${var.app_name}-task"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ecs-tasks.amazonaws.com" }
}]
})
}
# Add SSM permissions for ECS Exec
resource "aws_iam_role_policy" "task_exec" {
name = "ecs-exec"
role = aws_iam_role.task.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
]
Resource = "*"
}]
})
}
# alb.tf
resource "aws_security_group" "alb" {
name_prefix = "${var.app_name}-alb-"
vpc_id = var.vpc_id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb" "main" {
name = var.app_name
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
}
resource "aws_lb_target_group" "main" {
name = var.app_name
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
matcher = "200"
}
deregistration_delay = 30
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = var.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.main.arn
}
}
# task-definition.tf
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/${var.app_name}"
retention_in_days = 30
}
resource "aws_ecs_task_definition" "main" {
family = var.app_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "512"
memory = "1024"
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
runtime_platform {
cpu_architecture = "ARM64"
operating_system_family = "LINUX"
}
container_definitions = jsonencode([{
name = "api"
image = var.container_image
portMappings = [{
containerPort = 8080
protocol = "tcp"
}]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.app.name
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = "api"
}
}
secrets = [
{
name = "DB_PASSWORD"
valueFrom = "arn:aws:secretsmanager:us-east-1:123456789012:secret:${var.environment}/db-password"
}
]
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}])
}
# service.tf
resource "aws_security_group" "ecs" {
name_prefix = "${var.app_name}-ecs-"
vpc_id = var.vpc_id
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_ecs_service" "main" {
name = var.app_name
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.main.arn
desired_count = 3
enable_execute_command = true
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 1
base = 2
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 3
}
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.ecs.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.main.arn
container_name = "api"
container_port = 8080
}
deployment_configuration {
minimum_healthy_percent = 100
maximum_percent = 200
deployment_circuit_breaker {
enable = true
rollback = true
}
}
lifecycle {
ignore_changes = [desired_count]
}
}
# autoscaling.tf
resource "aws_appautoscaling_target" "ecs" {
max_capacity = 20
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
name = "${var.app_name}-cpu"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 60
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}