top of page

How We Reduced AWS EKS Costs by 65%: A Complete Implementation Guide

deepakvijayaraj


Introduction


Recently, I worked with a startup that was bleeding money on their EKS infrastructure. Their monthly AWS bill for EKS alone was hitting $15,000. Through careful analysis and implementation of several optimization strategies, we managed to cut their costs by 65% while improving overall cluster efficiency. Here's the complete breakdown of how we achieved this.


Initial Assessment


The Problem State
  • 50 nodes running in the EKS cluster

  • Average node utilization: 20%

  • Poor pod distribution across nodes

  • Over-provisioned persistent volumes

  • No auto-scaling strategy

  • All nodes running on on-demand pricing


Infrastructure Analysis Steps

1. Used kubectl top nodes to gather utilization metrics

2. Implemented metrics-server for detailed resource tracking

3. Used kube-resource-report to visualize cluster resource allocation

4. Created baseline cost allocation using AWS Cost Explorer with Kubernetes tags


Solution Implementation


1. Implementing Karpenter for Intelligent Node Provisioning
# karpenter.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi
  providerRef:
    name: default
  ttlSecondsUntilExpired: 2592000

Implementation steps:

1. Install Karpenter using Helm:

helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm install karpenter karpenter/karpenter --namespace karpenter \
--create-namespace --set serviceAccount.create=true \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN}

2. Configure node templates

3. Set up scaling metrics

4. Implement fallback strategies


2. Pod Topology Spread Constraints
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: sample-app

Implementation steps:

1. Define topology keys

2. Set up spread constraints

3. Configure maxSkew values

4. Test pod distribution


3. Spot Instance Integration
# spot-config.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: spot-template
spec:
  subnetSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelector:
    karpenter.sh/discovery: ${CLUSTER_NAME}
  instanceProfile: ${INSTANCE_PROFILE}
  instanceTypes: ["m5.large", "m5.xlarge", "m5.2xlarge"]
  capacityType: spot

Implementation steps:

1. Create spot instance configuration

2. Set up interruption handling

3. Configure instance diversity

4. Implement fallback mechanisms


4. Persistent Volume Optimization
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-storage
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"

Implementation steps:

1. Audit existing PV usage

2. Migrate to gp3 volumes

3. Implement dynamic provisioning

4. Set up volume snapshots


Results and Monitoring


Key Metrics Achieved
  • Node count reduction: 50 → 18

  • Average node utilization: 20% → 75%

  • Monthly cost savings: $9,750

  • Additional 15% savings from PV optimization


Monitoring Setup

1. Implemented Prometheus for metrics collection:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack

2. Created Grafana dashboards for:

  • Node utilization

  • Pod distribution

  • Cost allocation

  • Spot instance savings


Best Practices and Lessons Learned


1. Regular Cost Reviews

  • Set up weekly cost analysis meetings

  • Review Cost Explorer data

  • Track spot savings


2. Maintenance Procedures

  • Regular cluster version updates

  • Node rotation schedule

  • Backup verification


3. Alert Setup

  • Node utilization thresholds

  • Spot instance interruption

  • Cost anomalies


Conclusion


Through systematic implementation of these optimizations, we achieved significant cost savings while improving cluster efficiency. The key was not just implementing individual solutions, but ensuring they worked together cohesively.

Remember that cost optimization is an ongoing process, not a one-time task. Regular monitoring and adjustments are crucial for maintaining optimal cluster performance and cost efficiency.


Next Steps


If you're looking to implement similar optimizations, start with:

1. Conduct a thorough cluster audit

2. Implement monitoring before making changes

3. Make incremental changes

4. Document everything

5. Set up regular review cycles


Need help implementing these optimizations? Feel free to reach out to me on LinkedIn or comment below with your questions.

607 views1 comment

Recent Posts

See All

1 Kommentar


dhimas pramudya
dhimas pramudya
3 days ago

Hi Vijayaraj, thank you for sharing, i am interested with the spot instance and how to do the smart fallback to On-Demand, did you have any resource or recommendation related those implementation? Thank you.

Gefällt mir
bottom of page