How We Reduced AWS EKS Costs by 65%: A Complete Implementation Guide

Jan 143 min read

Introduction

Recently, I worked with a startup that was bleeding money on their EKS infrastructure. Their monthly AWS bill for EKS alone was hitting $15,000. Through careful analysis and implementation of several optimization strategies, we managed to cut their costs by 65% while improving overall cluster efficiency. Here's the complete breakdown of how we achieved this.

Initial Assessment

The Problem State

50 nodes running in the EKS cluster
Average node utilization: 20%
Poor pod distribution across nodes
Over-provisioned persistent volumes
No auto-scaling strategy
All nodes running on on-demand pricing

Infrastructure Analysis Steps

1. Used kubectl top nodes to gather utilization metrics

2. Implemented metrics-server for detailed resource tracking

3. Used kube-resource-report to visualize cluster resource allocation

4. Created baseline cost allocation using AWS Cost Explorer with Kubernetes tags

Solution Implementation

1. Implementing Karpenter for Intelligent Node Provisioning

# karpenter.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi
  providerRef:
    name: default
  ttlSecondsUntilExpired: 2592000

Implementation steps:

1. Install Karpenter using Helm:

helm repo add karpenter https://charts.karpenter.sh

helm repo update

helm install karpenter karpenter/karpenter --namespace karpenter \
--create-namespace --set serviceAccount.create=true \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN}

2. Configure node templates

3. Set up scaling metrics

4. Implement fallback strategies

2. Pod Topology Spread Constraints

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: sample-app

Implementation steps:

1. Define topology keys

2. Set up spread constraints

3. Configure maxSkew values

4. Test pod distribution

3. Spot Instance Integration

# spot-config.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: spot-template
spec:
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  instanceProfile: ${INSTANCE_PROFILE}
  instanceTypes: ["m5.large", "m5.xlarge", "m5.2xlarge"]
  capacityType: spot

Implementation steps:

1. Create spot instance configuration

2. Set up interruption handling

3. Configure instance diversity

4. Implement fallback mechanisms

4. Persistent Volume Optimization

# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-storage
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"

Implementation steps:

1. Audit existing PV usage

2. Migrate to gp3 volumes

3. Implement dynamic provisioning

4. Set up volume snapshots

Results and Monitoring

Key Metrics Achieved

Node count reduction: 50 → 18
Average node utilization: 20% → 75%
Monthly cost savings: $9,750
Additional 15% savings from PV optimization

Monitoring Setup

1. Implemented Prometheus for metrics collection:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install prometheus prometheus-community/kube-prometheus-stack

2. Created Grafana dashboards for:

Node utilization
Pod distribution
Cost allocation
Spot instance savings

Best Practices and Lessons Learned

1. Regular Cost Reviews

Set up weekly cost analysis meetings
Review Cost Explorer data
Track spot savings

2. Maintenance Procedures

Regular cluster version updates
Node rotation schedule
Backup verification

3. Alert Setup

Node utilization thresholds
Spot instance interruption
Cost anomalies

Conclusion

Through systematic implementation of these optimizations, we achieved significant cost savings while improving cluster efficiency. The key was not just implementing individual solutions, but ensuring they worked together cohesively.

Remember that cost optimization is an ongoing process, not a one-time task. Regular monitoring and adjustments are crucial for maintaining optimal cluster performance and cost efficiency.

Next Steps

If you're looking to implement similar optimizations, start with:

1. Conduct a thorough cluster audit

2. Implement monitoring before making changes

3. Make incremental changes

4. Document everything

5. Set up regular review cycles

Need help implementing these optimizations? Feel free to reach out to me on LinkedIn or comment below with your questions.

Download Free E-Books

How We Reduced AWS EKS Costs by 65%: A Complete Implementation Guide

Introduction

Initial Assessment

The Problem State

Infrastructure Analysis Steps

Solution Implementation

1. Implementing Karpenter for Intelligent Node Provisioning

Implementation steps:

2. Pod Topology Spread Constraints

3. Spot Instance Integration

4. Persistent Volume Optimization

Results and Monitoring

Key Metrics Achieved

Monitoring Setup

Best Practices and Lessons Learned

Conclusion

Next Steps

Recent Posts

1 commentaire

SIGN UP FOR MY WEEKLY NEWSLETTER