Skip to content
AWS ECS Fargate architecture diagram showing tasks, services, and capacity providers

ECS Fargate removes the need to manage EC2 instances - you define tasks and AWS handles the rest

Last updated: May 2026 - Covers ECS Fargate 2026 pricing, capacity providers, ECS Exec, Service Connect, blue/green with CodeDeploy, Container Insights enhanced observability, and complete Terraform examples.

ECS vs EKS - The Decision Framework

ECS and EKS both run containers on AWS, but they solve different problems. ECS is AWS's proprietary container orchestrator. EKS is managed Kubernetes. The right choice depends on your team's skills, portability requirements, and operational appetite.

ECS has zero control plane cost. EKS charges $0.10/hour ($73/month) per cluster just for the Kubernetes API server. For small teams running a handful of services, that difference adds up fast.

CriteriaECSEKS
Control plane costFree$0.10/hr ($73/mo)
Learning curveLow - AWS-native conceptsHigh - full Kubernetes API
EcosystemAWS-only integrationsHelm, operators, CNCF tools
PortabilityAWS lock-inMulti-cloud possible
Service meshService Connect (built-in)Istio, Linkerd, App Mesh
Auto-scalingApplication Auto ScalingKarpenter, HPA, VPA, KEDA
Blue/green deploysCodeDeploy nativeArgo Rollouts, Flagger
DebuggingECS Exec (SSM)kubectl exec
GPU supportYes (EC2 launch type)Yes (managed node groups)
Windows containersYesYes
Decision shortcut: If your team does not already know Kubernetes, start with ECS. You can always migrate to EKS later. If you need multi-cloud portability or your team lives in kubectl, go with EKS. There is no wrong answer - both run the same containers.

ECS shines when you want tight AWS integration without managing Kubernetes complexity. CloudFormation and Terraform both have first-class ECS support. Load balancer target groups, service discovery, secrets, and IAM roles all wire up natively without extra controllers or CRDs.

EKS wins when you need the Kubernetes ecosystem. Helm charts, custom operators, GitOps with ArgoCD, and the ability to run the same manifests on GKE or AKS. The tradeoff is operational complexity - you are responsible for node groups, add-ons, RBAC policies, and cluster upgrades.

When ECS is the clear winner

  • Teams under 10 engineers with no Kubernetes experience
  • Startups that need to ship fast without infrastructure overhead
  • Workloads that are 100% AWS and will stay that way
  • Simple microservice architectures (under 20 services)
  • Cost-sensitive environments where $73/month per cluster matters

When EKS is the clear winner

  • Teams with existing Kubernetes expertise and tooling
  • Multi-cloud or hybrid-cloud requirements
  • Complex service mesh needs beyond what Service Connect provides
  • Heavy use of Helm charts and Kubernetes operators
  • Organizations with 50+ microservices and dedicated platform teams

Fargate Pricing 2026

Fargate pricing is per-second with a one-minute minimum. You pay for the vCPU and memory your task definition requests, not what the container actually uses. Getting your resource requests right is the single biggest lever for cost control.

All prices below are for US East (N. Virginia) as of May 2026. Other regions vary by 10-25%.

ResourceOn-DemandSpotSavings Plan (1yr)
vCPU per hour$0.04048$0.01214$0.01943
GB memory per hour$0.004445$0.001334$0.002134
Ephemeral storage (above 20 GB)$0.000111/GB/hr$0.000111/GB/hrN/A
1 vCPU / 2 GB task (monthly)$35.74$10.72$17.16
2 vCPU / 4 GB task (monthly)$71.48$21.44$34.31
4 vCPU / 8 GB task (monthly)$142.96$42.89$68.62
Common pricing mistake: Fargate bills for the task definition's requested resources, not actual utilization. A task requesting 4 vCPU but using only 0.5 vCPU still pays the full 4 vCPU rate. Use Container Insights metrics to right-size your task definitions.

Fargate vs EC2 launch type cost comparison

For steady-state workloads running 24/7, EC2 launch type with Reserved Instances can be 40-60% cheaper than Fargate on-demand. But Fargate eliminates patching, AMI updates, and capacity planning. The operational savings often outweigh the compute premium, especially for teams without dedicated infrastructure engineers.

The break-even point depends on utilization. If your EC2 instances run above 70% CPU utilization consistently, EC2 launch type wins on cost. Below 50% utilization, Fargate's per-task billing usually wins because you are not paying for idle capacity.

ARM64 (Graviton) pricing advantage

Fargate tasks running on ARM64 (Graviton3) processors cost 20% less than x86 equivalents with comparable or better performance. A 1 vCPU / 2 GB ARM64 task costs $28.59/month versus $35.74 for x86. For most web services and API workloads, switching to ARM64 requires only changing the runtimePlatform in your task definition and rebuilding your container image for linux/arm64.

ECS Architecture Deep Dive

ECS has four core building blocks: clusters, task definitions, tasks, and services. Understanding how they fit together is essential before deploying anything to production.

Task definitions

A task definition is a blueprint for your application. It specifies container images, CPU/memory limits, port mappings, environment variables, logging configuration, and IAM roles. Think of it as a docker-compose file that AWS understands.

{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  },
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/api-service-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3",
      "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "api"
        }
      },
      "secrets": [
        {"name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password"}
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Key details that trip people up:

  • networkMode must be awsvpc for Fargate. Each task gets its own ENI and private IP.
  • cpu and memory are strings at the task level, integers at the container level.
  • executionRoleArn is for ECS agent operations (pulling images, writing logs). taskRoleArn is for your application code (calling S3, DynamoDB, etc.).
  • Fargate supports specific CPU/memory combinations. You cannot request arbitrary values.

Services

An ECS service maintains a desired count of running tasks and integrates with load balancers, service discovery, and deployment controllers. It is the layer that makes your containers production-ready.

Services handle rolling updates, health check grace periods, circuit breakers, and capacity provider strategies. A well-configured service definition is the difference between a toy deployment and a production system.

Capacity providers

Capacity providers tell ECS where to place tasks. For Fargate, you have two built-in providers: FARGATE (on-demand) and FARGATE_SPOT. You can mix them with a strategy:

{
  "capacityProviderStrategy": [
    {"capacityProvider": "FARGATE", "weight": 1, "base": 2},
    {"capacityProvider": "FARGATE_SPOT", "weight": 3, "base": 0}
  ]
}

This configuration keeps a minimum of 2 on-demand tasks (the base) and distributes additional tasks 75% to Spot and 25% to on-demand. The base tasks guarantee availability even during Spot capacity shortages.

Networking with awsvpc

Every Fargate task gets its own elastic network interface (ENI) in your VPC. This means tasks have real private IPs, can be placed in private subnets, and are subject to security group rules. The downside is ENI density limits per subnet - plan your CIDR blocks accordingly. A /24 subnet supports roughly 250 tasks.

For public-facing services, place tasks in private subnets behind an Application Load Balancer in public subnets. Tasks should never have public IPs in production.

Production Deployment Patterns

ECS supports three deployment controllers: rolling update (default), blue/green via CodeDeploy, and external controllers. Each has tradeoffs between complexity, rollback speed, and blast radius.

Rolling updates

The default deployment type. ECS drains old tasks and starts new ones in batches controlled by minimumHealthyPercent and maximumPercent. For a service with 4 tasks:

  • minimumHealthyPercent: 50 - ECS can stop 2 tasks before starting replacements
  • maximumPercent: 200 - ECS can run up to 8 tasks during deployment

Rolling updates are simple and work well for most services. The downside is slow rollbacks - if the new version is broken, you must deploy the old version forward rather than instantly switching back.

Blue/green with CodeDeploy

Blue/green deployments run the new version alongside the old version and shift traffic atomically. ECS integrates with CodeDeploy to manage this through ALB target group switching.

# appspec.yaml for ECS blue/green
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:us-east-1:123456789012:task-definition/api:42"
        LoadBalancerInfo:
          ContainerName: "api"
          ContainerPort: 8080
Hooks:
  - BeforeAllowTraffic: "LambdaValidationFunction"
  - AfterAllowTraffic: "LambdaSmokeTestFunction"

CodeDeploy supports three traffic shifting strategies:

StrategyBehaviorBest for
AllAtOnce100% traffic shift immediatelyDev/staging environments
Linear10PercentEvery1Minute10% shift every minuteLow-risk production deploys
Canary10Percent5Minutes10% for 5 min, then 100%High-risk changes

The killer feature is instant rollback. If your validation Lambda or CloudWatch alarms detect problems, CodeDeploy shifts traffic back to the blue target group in seconds. No redeployment needed.

Circuit breakers

ECS deployment circuit breakers automatically roll back failed deployments without manual intervention. Enable them in your service definition:

{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }
}

The circuit breaker triggers when ECS cannot reach the desired task count after multiple attempts. It tracks the ratio of failed task launches to successful ones. If the failure threshold is breached, ECS automatically rolls back to the last stable deployment. This is a must-have for any production service.

Service discovery with Cloud Map

AWS Cloud Map provides DNS-based and API-based service discovery for ECS services. When a task starts, ECS registers it with Cloud Map. Other services resolve the name via private DNS.

# Tasks register as: api.production.local
# Other services connect via DNS
curl http://api.production.local:8080/health

For more advanced service-to-service communication, ECS Service Connect (GA since 2023) provides a built-in service mesh powered by Envoy. It handles load balancing, retries, timeouts, and observability between services without requiring you to manage Envoy configuration directly.

Service Connect is the recommended approach for new ECS deployments. It replaces the older App Mesh service and provides better integration with ECS task networking. See the Service Connect documentation for setup details.

ECS Exec - Debugging Running Containers

ECS Exec gives you interactive shell access to running Fargate containers using AWS Systems Manager (SSM). No SSH, no bastion hosts, no sidecar containers. It works by injecting the SSM agent into your task at runtime.

Enabling ECS Exec

# Enable on an existing service
aws ecs update-service \
  --cluster production \
  --service api-service \
  --enable-execute-command

# Start a new task with exec enabled
aws ecs run-task \
  --cluster production \
  --task-definition api-service:42 \
  --enable-execute-command \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-abc123]}"

Connecting to a container

# Interactive shell
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789012:task/production/abc123 \
  --container api \
  --interactive \
  --command "/bin/sh"

# Run a single command
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789012:task/production/abc123 \
  --container api \
  --interactive \
  --command "cat /app/config.yaml"

Prerequisites

  • The task execution role needs ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:OpenControlChannel, and ssmmessages:OpenDataChannel permissions
  • Install the Session Manager plugin on your local machine
  • The task must be in RUNNING state
  • Platform version 1.4.0 or later (current default is 1.4.0)
Security note: ECS Exec sessions are logged to CloudTrail. For production environments, also enable SSM session logging to S3 or CloudWatch Logs for full audit trails of every command executed inside containers.

Troubleshooting ECS Exec

The most common failure is the SSM agent not starting inside the container. Use the amazon-ecs-exec-checker script to diagnose:

# Install the checker
git clone https://github.com/aws-containers/amazon-ecs-exec-checker.git
cd amazon-ecs-exec-checker

# Run diagnostics
./check-ecs-exec.sh production abc123def456

Common issues: VPC endpoints missing for ssmmessages (required if tasks are in private subnets without NAT), task role missing SSM permissions, or the container running as a non-root user without the required capabilities.

Auto-Scaling Strategies

ECS uses Application Auto Scaling to adjust the desired count of tasks in a service. You can scale on CPU, memory, custom CloudWatch metrics, or ALB request count. The right strategy depends on your workload pattern.

Target tracking (recommended default)

Target tracking is the simplest and most effective scaling policy for most services. You set a target value and ECS adjusts task count to maintain it.

# Scale to maintain 60% average CPU utilization
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 60.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Key settings to tune:

  • ScaleOutCooldown: 60 seconds - scale up fast when load increases
  • ScaleInCooldown: 300 seconds - scale down slowly to avoid flapping
  • Target value: 60-70% for CPU - leaves headroom for traffic spikes

ALB request count per target

For web services behind an ALB, scaling on request count per target is often more responsive than CPU-based scaling. It reacts to traffic increases before CPU utilization climbs.

{
  "TargetValue": 1000.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ALBRequestCountPerTarget",
    "ResourceLabel": "app/my-alb/abc123/targetgroup/my-tg/def456"
  }
}

Step scaling for bursty workloads

Step scaling lets you define different scaling actions at different alarm thresholds. Useful for workloads with sudden traffic spikes where you need aggressive scale-out:

  • CPU 60-75%: add 2 tasks
  • CPU 75-90%: add 4 tasks
  • CPU above 90%: add 8 tasks

Scheduled scaling

If your traffic follows predictable patterns (business hours, batch processing windows), scheduled scaling pre-provisions capacity before the load arrives:

# Scale up for business hours (8 AM EST)
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name scale-up-morning \
  --schedule "cron(0 13 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=6,MaxCapacity=20

# Scale down for evenings (8 PM EST)
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name scale-down-evening \
  --schedule "cron(0 1 ? * TUE-SAT *)" \
  --scalable-target-action MinCapacity=2,MaxCapacity=10
Pro tip: Combine scheduled scaling with target tracking. Use scheduled scaling to set the floor (minimum capacity) and let target tracking handle dynamic adjustments above that floor. This gives you both predictable baseline capacity and reactive scaling.

Observability - Container Insights, X-Ray, and FireLens

Running containers without observability is flying blind. ECS integrates with three AWS services that together provide metrics, traces, and structured logs for your production workloads. For a deeper dive into observability tooling, see our Observability Stack guide.

Container Insights

Container Insights collects CPU, memory, network, and storage metrics at the task and service level. The enhanced observability mode (launched 2024) adds container-level metrics and automatic dashboard generation.

# Enable Container Insights on a cluster
aws ecs update-cluster-settings \
  --cluster production \
  --settings name=containerInsights,value=enhanced

Enhanced Container Insights provides:

  • Per-container CPU and memory utilization (not just per-task)
  • Network bytes in/out per container
  • Storage read/write operations
  • Automatic CloudWatch dashboards with service maps
  • Performance anomaly detection with ML-based alerts

The cost is $0.01 per Container Insights metric per month. For a cluster with 20 services, expect roughly $15-25/month in Container Insights charges. Worth every penny compared to debugging blind.

AWS X-Ray for distributed tracing

X-Ray traces requests across your microservices, showing latency breakdowns, error rates, and dependency maps. For ECS Fargate, deploy the X-Ray daemon as a sidecar container:

{
  "name": "xray-daemon",
  "image": "public.ecr.aws/xray/aws-xray-daemon:latest",
  "cpu": 32,
  "memoryReservation": 64,
  "portMappings": [{"containerPort": 2000, "protocol": "udp"}],
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/xray-daemon",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "xray"
    }
  }
}

Your application sends trace segments to localhost:2000 via UDP. The X-Ray SDK is available for Java, Python, Node.js, Go, .NET, and Ruby. For languages without an SDK, use the OpenTelemetry collector with the X-Ray exporter instead.

FireLens for log routing

FireLens is an ECS-native log router built on Fluent Bit. Instead of sending all logs to CloudWatch Logs (expensive at scale), FireLens lets you route logs to S3, Elasticsearch, Datadog, Splunk, or any Fluent Bit output plugin.

{
  "name": "log-router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "firelensConfiguration": {
    "type": "fluentbit",
    "options": {
      "config-file-type": "file",
      "config-file-value": "/fluent-bit/configs/output.conf"
    }
  },
  "essential": true,
  "cpu": 64,
  "memoryReservation": 128
}

A common production pattern routes logs to three destinations simultaneously:

  • CloudWatch Logs - for real-time debugging and CloudWatch Logs Insights queries
  • S3 - for long-term retention and compliance (much cheaper than CloudWatch)
  • OpenSearch - for full-text search and dashboards
Cost savings: Routing logs to S3 instead of CloudWatch Logs can reduce logging costs by 80-90%. CloudWatch Logs ingestion costs $0.50/GB. S3 Standard storage costs $0.023/GB/month. For a service producing 100 GB/month of logs, that is $50/month vs $2.30/month.

Security Best Practices

Container security on ECS spans IAM roles, secrets management, network isolation, and image scanning. Getting these right from day one prevents the kind of security incidents that wake you up at 3 AM. For a broader view, see our Cloud Security guide.

Task roles vs execution roles

This is the most commonly confused aspect of ECS security. There are two distinct IAM roles:

  • Execution role - used by the ECS agent to pull container images from ECR, write logs to CloudWatch, and retrieve secrets. This role is never available to your application code.
  • Task role - assumed by your application code at runtime. If your API needs to read from S3 or write to DynamoDB, those permissions go on the task role.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:Query"
      ],
      "Resource": "arn:aws:dynamodb:us-east-1:123456789012:table/my-app-table"
    }
  ]
}

Follow the principle of least privilege. Each service should have its own task role with only the permissions it needs. Never share task roles across services.

Secrets management

Never bake secrets into container images or pass them as plain-text environment variables. ECS integrates with both AWS Secrets Manager and SSM Parameter Store to inject secrets at task startup:

{
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password"
    },
    {
      "name": "API_KEY",
      "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/api-key"
    }
  ]
}

Secrets Manager costs $0.40/secret/month plus $0.05 per 10,000 API calls. SSM Parameter Store SecureString parameters are free for standard throughput. Use Secrets Manager for credentials that need automatic rotation. Use SSM Parameter Store for everything else.

VPC and network security

  • Private subnets only - Fargate tasks should never have public IPs. Use a NAT Gateway or VPC endpoints for outbound access.
  • Security groups per service - each ECS service should have its own security group. Allow only the ports and sources that service needs.
  • VPC endpoints - for private subnets, create endpoints for ECR (ecr.api, ecr.dkr), S3 (gateway), CloudWatch Logs, Secrets Manager, and SSM. This keeps traffic off the public internet and avoids NAT Gateway data processing charges.
  • Network ACLs - use as a secondary defense layer. Keep them simple - deny known bad CIDR ranges and allow everything else at the NACL level. Let security groups handle fine-grained access control.

Image security

  • Enable ECR image scanning on push. It checks for CVEs in OS packages and language dependencies.
  • Use minimal base images (distroless, Alpine, or scratch) to reduce attack surface.
  • Pin image tags to immutable digests in production: image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api@sha256:abc123...
  • Enable ECR lifecycle policies to automatically clean up untagged and old images.
  • Run containers as a non-root user. Add USER 1000 to your Dockerfile.
Do not use :latest tag in production. It is mutable and can point to different images over time. Always use specific version tags or SHA digests. A deployment that worked yesterday can break today if someone pushes a new :latest image.

CI/CD with GitHub Actions

A production ECS pipeline builds the container image, pushes it to ECR, updates the task definition, and deploys the new service version. GitHub Actions has official AWS actions that make this straightforward.

Complete workflow

# .github/workflows/deploy.yml
name: Deploy to ECS
on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: api-service
  ECS_CLUSTER: production
  ECS_SERVICE: api-service
  TASK_DEFINITION: .aws/task-definition.json
  CONTAINER_NAME: api

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        id: ecr-login
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build and push image
        id: build
        env:
          ECR_REGISTRY: ${{ steps.ecr-login.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build \
            --platform linux/arm64 \
            -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
            -t $ECR_REGISTRY/$ECR_REPOSITORY:latest \
            .
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
          echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

      - name: Update task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ${{ env.TASK_DEFINITION }}
          container-name: ${{ env.CONTAINER_NAME }}
          image: ${{ steps.build.outputs.image }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true
          wait-for-minutes: 10

Key details

  • OIDC authentication - use role-to-assume instead of storing AWS access keys as secrets. OIDC is more secure and eliminates key rotation.
  • ARM64 builds - the --platform linux/arm64 flag builds for Graviton. If your GitHub runner is x86, enable Docker BuildKit with QEMU emulation or use a self-hosted ARM runner.
  • wait-for-service-stability - the deploy step waits until the new tasks pass health checks and the old tasks drain. If the deployment fails, the step fails and you get a clear error in your PR.
  • Image tagging - use the git SHA as the primary tag for traceability. Push :latest as a convenience tag but never reference it in task definitions.

Blue/green variant

For CodeDeploy blue/green deployments, replace the final step with:

      - name: Deploy (blue/green)
        uses: aws-actions/amazon-ecs-deploy-task-definition@v2
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          codedeploy-appspec: .aws/appspec.yaml
          codedeploy-application: api-service
          codedeploy-deployment-group: api-service-prod

This triggers a CodeDeploy deployment instead of a rolling update, giving you canary analysis and instant rollback capabilities.

Cost Optimization

Container costs on AWS can spiral quickly if you are not deliberate about optimization. The three biggest levers are Fargate Spot, ARM64/Graviton, and Compute Savings Plans. Combined, they can reduce your bill by 70-80%. For a comprehensive approach, see our AWS Cost Optimization guide.

Fargate Spot - 70% savings

Fargate Spot uses spare AWS capacity at a 70% discount. Tasks can be interrupted with a 30-second SIGTERM warning when AWS needs the capacity back. Use Spot for:

  • Queue workers and background job processors
  • Batch data processing pipelines
  • Development and staging environments
  • Stateless web services with multiple replicas (behind a load balancer)

Do not use Spot for singleton tasks, stateful workloads, or anything that cannot tolerate a 30-second shutdown. The capacity provider strategy shown earlier (base of 2 on-demand, rest on Spot) is the recommended pattern for production web services.

ARM64/Graviton - 20% savings

AWS Graviton3 processors deliver 20% lower cost and up to 40% better price-performance compared to x86 for most workloads. Switching requires two changes:

  1. Update your task definition's runtimePlatform.cpuArchitecture to ARM64
  2. Build your container image for linux/arm64
# Multi-arch Dockerfile
FROM --platform=$TARGETPLATFORM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
USER 1000
EXPOSE 8080
CMD ["node", "server.js"]
# Build multi-arch image with Docker BuildKit
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t 123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3 \
  --push .

Most Node.js, Python, Go, and Java applications work on ARM64 without code changes. Watch out for native binary dependencies (some npm packages with C extensions, Python packages with compiled wheels). Test thoroughly before switching production.

Compute Savings Plans - up to 52% savings

Compute Savings Plans commit to a consistent amount of compute spend (measured in $/hour) for 1 or 3 years. They apply automatically to Fargate, Lambda, and EC2 usage across all regions.

Plan typeTermPaymentDiscount
Compute Savings Plan1 yearAll upfront37%
Compute Savings Plan1 yearNo upfront29%
Compute Savings Plan3 yearAll upfront52%
Compute Savings Plan3 yearNo upfront43%

Start by analyzing your last 30 days of Fargate usage in Cost Explorer. Look at the Savings Plans recommendations page - it calculates the optimal commitment level based on your actual usage patterns. Commit to 70-80% of your baseline usage and let the rest run on-demand for flexibility.

Right-sizing task definitions

Over-provisioned task definitions are the silent budget killer. Use Container Insights metrics to compare requested vs actual CPU and memory usage:

-- CloudWatch Logs Insights query for right-sizing
SELECT avg(CpuUtilized) as avg_cpu,
       max(CpuUtilized) as max_cpu,
       avg(MemoryUtilized) as avg_mem_mb,
       max(MemoryUtilized) as max_mem_mb
FROM "/aws/ecs/containerinsights/production/performance"
WHERE ServiceName = 'api-service'
  AND Type = 'Task'
GROUP BY bin(1h)

If your task requests 1024 CPU units but peaks at 400, drop it to 512. If memory peaks at 600 MB but you allocated 2048 MB, drop to 1024 MB. Each reduction directly lowers your per-second Fargate bill.

Combined savings example

OptimizationMonthly cost (10 tasks, 1 vCPU / 2 GB)Savings
Baseline (x86, on-demand)$357.40-
+ ARM64$285.9220%
+ Spot (70% of tasks)$160.7355%
+ Savings Plan (on-demand portion)$118.1467%
+ Right-sizing (0.5 vCPU / 1 GB)$59.0783%

From $357/month to $59/month for the same workload. That is an 83% reduction by stacking four optimization strategies.

Complete Terraform Example

Here is a production-ready Terraform configuration that ties together everything covered in this guide: Fargate with Spot capacity providers, ALB, auto-scaling, Container Insights, security groups, and secrets. For more on infrastructure as code, see our IaC Guide.

# providers.tf
terraform {
  required_version = ">= 1.8"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.50"
    }
  }
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "ecs/production/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = "us-east-1"
}

# variables.tf
variable "app_name" {
  default = "api-service"
}

variable "environment" {
  default = "production"
}

variable "container_image" {
  description = "Full ECR image URI with tag"
  type        = string
}

variable "vpc_id" {
  type = string
}

variable "private_subnet_ids" {
  type = list(string)
}

variable "public_subnet_ids" {
  type = list(string)
}

# ecs-cluster.tf
resource "aws_ecs_cluster" "main" {
  name = var.environment

  setting {
    name  = "containerInsights"
    value = "enhanced"
  }
}

resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name       = aws_ecs_cluster.main.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2
  }

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 3
  }
}

# iam.tf
resource "aws_iam_role" "execution" {
  name = "${var.app_name}-execution"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "execution" {
  role       = aws_iam_role.execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy" "execution_secrets" {
  name = "secrets-access"
  role = aws_iam_role.execution.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["secretsmanager:GetSecretValue"]
      Resource = ["arn:aws:secretsmanager:us-east-1:*:secret:${var.environment}/*"]
    }]
  })
}

resource "aws_iam_role" "task" {
  name = "${var.app_name}-task"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "ecs-tasks.amazonaws.com" }
    }]
  })
}

# Add SSM permissions for ECS Exec
resource "aws_iam_role_policy" "task_exec" {
  name = "ecs-exec"
  role = aws_iam_role.task.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ]
      Resource = "*"
    }]
  })
}

# alb.tf
resource "aws_security_group" "alb" {
  name_prefix = "${var.app_name}-alb-"
  vpc_id      = var.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb" "main" {
  name               = var.app_name
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids
}

resource "aws_lb_target_group" "main" {
  name        = var.app_name
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    matcher             = "200"
  }

  deregistration_delay = 30
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.main.arn
  }
}

# task-definition.tf
resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/${var.app_name}"
  retention_in_days = 30
}

resource "aws_ecs_task_definition" "main" {
  family                   = var.app_name
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.execution.arn
  task_role_arn            = aws_iam_role.task.arn

  runtime_platform {
    cpu_architecture        = "ARM64"
    operating_system_family = "LINUX"
  }

  container_definitions = jsonencode([{
    name  = "api"
    image = var.container_image
    portMappings = [{
      containerPort = 8080
      protocol      = "tcp"
    }]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.app.name
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "api"
      }
    }
    secrets = [
      {
        name      = "DB_PASSWORD"
        valueFrom = "arn:aws:secretsmanager:us-east-1:123456789012:secret:${var.environment}/db-password"
      }
    ]
    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }
  }])
}

# service.tf
resource "aws_security_group" "ecs" {
  name_prefix = "${var.app_name}-ecs-"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_ecs_service" "main" {
  name                   = var.app_name
  cluster                = aws_ecs_cluster.main.id
  task_definition        = aws_ecs_task_definition.main.arn
  desired_count          = 3
  enable_execute_command = true

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 2
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 3
  }

  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.ecs.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.main.arn
    container_name   = "api"
    container_port   = 8080
  }

  deployment_configuration {
    minimum_healthy_percent = 100
    maximum_percent         = 200

    deployment_circuit_breaker {
      enable   = true
      rollback = true
    }
  }

  lifecycle {
    ignore_changes = [desired_count]
  }
}

# autoscaling.tf
resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "${var.app_name}-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 60
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}
Production checklist: This Terraform example covers the core infrastructure. For a complete production setup, also add: WAF rules on the ALB, Route 53 DNS records, ACM certificate, VPC endpoints for ECR/S3/CloudWatch/SSM, CloudWatch alarms for 5xx errors and latency, and SNS notifications for deployment events.