Introduction
Recently, I worked with a startup that was bleeding money on their EKS infrastructure. Their monthly AWS bill for EKS alone was hitting $15,000. Through careful analysis and implementation of several optimization strategies, we managed to cut their costs by 65% while improving overall cluster efficiency. Here's the complete breakdown of how we achieved this.
Initial Assessment
The Problem State
50 nodes running in the EKS cluster
Average node utilization: 20%
Poor pod distribution across nodes
Over-provisioned persistent volumes
No auto-scaling strategy
All nodes running on on-demand pricing
Infrastructure Analysis Steps
1. Used kubectl top nodes to gather utilization metrics
2. Implemented metrics-server for detailed resource tracking
3. Used kube-resource-report to visualize cluster resource allocation
4. Created baseline cost allocation using AWS Cost Explorer with Kubernetes tags
Solution Implementation
1. Implementing Karpenter for Intelligent Node Provisioning
# karpenter.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["m5.large", "m5.xlarge", "m5.2xlarge"]
limits:
resources:
cpu: "1000"
memory: 1000Gi
providerRef:
name: default
ttlSecondsUntilExpired: 2592000
Implementation steps:
1. Install Karpenter using Helm:
helm repo add karpenter https://charts.karpenter.sh
helm repo update
helm install karpenter karpenter/karpenter --namespace karpenter \
--create-namespace --set serviceAccount.create=true \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN}
2. Configure node templates
3. Set up scaling metrics
4. Implement fallback strategies
2. Pod Topology Spread Constraints
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: sample-app
Implementation steps:
1. Define topology keys
2. Set up spread constraints
3. Configure maxSkew values
4. Test pod distribution
3. Spot Instance Integration
# spot-config.yaml
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: spot-template
spec:
subnetSelector:
karpenter.sh/discovery: ${CLUSTER_NAME}
securityGroupSelector:
karpenter.sh/discovery: ${CLUSTER_NAME}
instanceProfile: ${INSTANCE_PROFILE}
instanceTypes: ["m5.large", "m5.xlarge", "m5.2xlarge"]
capacityType: spot
Implementation steps:
1. Create spot instance configuration
2. Set up interruption handling
3. Configure instance diversity
4. Implement fallback mechanisms
4. Persistent Volume Optimization
# storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-storage
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
Implementation steps:
1. Audit existing PV usage
2. Migrate to gp3 volumes
3. Implement dynamic provisioning
4. Set up volume snapshots
Results and Monitoring
Key Metrics Achieved
Node count reduction: 50 → 18
Average node utilization: 20% → 75%
Monthly cost savings: $9,750
Additional 15% savings from PV optimization
Monitoring Setup
1. Implemented Prometheus for metrics collection:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
2. Created Grafana dashboards for:
Node utilization
Pod distribution
Cost allocation
Spot instance savings
Best Practices and Lessons Learned
1. Regular Cost Reviews
Set up weekly cost analysis meetings
Review Cost Explorer data
Track spot savings
2. Maintenance Procedures
Regular cluster version updates
Node rotation schedule
Backup verification
3. Alert Setup
Node utilization thresholds
Spot instance interruption
Cost anomalies
Conclusion
Through systematic implementation of these optimizations, we achieved significant cost savings while improving cluster efficiency. The key was not just implementing individual solutions, but ensuring they worked together cohesively.
Remember that cost optimization is an ongoing process, not a one-time task. Regular monitoring and adjustments are crucial for maintaining optimal cluster performance and cost efficiency.
Next Steps
If you're looking to implement similar optimizations, start with:
1. Conduct a thorough cluster audit
2. Implement monitoring before making changes
3. Make incremental changes
4. Document everything
5. Set up regular review cycles
Need help implementing these optimizations? Feel free to reach out to me on LinkedIn or comment below with your questions.
Hi Vijayaraj, thank you for sharing, i am interested with the spot instance and how to do the smart fallback to On-Demand, did you have any resource or recommendation related those implementation? Thank you.