|
| 1 | +# Cost Optimization Strategy for Coder Demo |
| 2 | + |
| 3 | +## Mixed Capacity Approach |
| 4 | + |
| 5 | +### Node Group Strategy |
| 6 | + |
| 7 | +**System Nodes (ON_DEMAND)** |
| 8 | + |
| 9 | +- **Purpose**: Run critical Kubernetes infrastructure |
| 10 | +- **Workloads**: CoreDNS, kube-proxy, metrics-server, cert-manager, AWS LB Controller |
| 11 | +- **Size**: t4g.medium (ARM Graviton) |
| 12 | +- **Count**: 1-2 nodes minimum |
| 13 | +- **Cost**: ~$24/month (1 node) to $48/month (2 nodes) |
| 14 | + |
| 15 | +**Application Nodes (MIXED: 20% On-Demand, 80% Spot via Karpenter)** |
| 16 | + |
| 17 | +- **Purpose**: Run Coder server and workspaces |
| 18 | +- **Spot Savings**: 70-90% cost reduction |
| 19 | +- **Interruption Risk**: Mitigated by: |
| 20 | + - Multiple instance types (diversified Spot pools) |
| 21 | + - Karpenter auto-rebalancing |
| 22 | + - Pod Disruption Budgets |
| 23 | + |
| 24 | +### Karpenter NodePool Configuration |
| 25 | + |
| 26 | +#### 1. Coder Server NodePool (ON_DEMAND Priority) |
| 27 | + |
| 28 | +```yaml |
| 29 | +capacity_type: ["on-demand", "spot"] # Prefer On-Demand, fallback to Spot |
| 30 | +weight: |
| 31 | + on-demand: 100 # Higher priority |
| 32 | + spot: 10 |
| 33 | +``` |
| 34 | +
|
| 35 | +#### 2. Coder Workspace NodePool (SPOT Priority) |
| 36 | +
|
| 37 | +```yaml |
| 38 | +capacity_type: ["spot", "on-demand"] # Prefer Spot, fallback to On-Demand |
| 39 | +weight: |
| 40 | + spot: 100 # Higher priority |
| 41 | + on-demand: 10 |
| 42 | +``` |
| 43 | +
|
| 44 | +### Risk Mitigation |
| 45 | +
|
| 46 | +**Spot Interruption Handling:** |
| 47 | +
|
| 48 | +1. **2-minute warning** → Karpenter automatically provisions replacement |
| 49 | +2. **Multiple instance types** → 15+ types reduces interruption rate to <1% |
| 50 | +3. **Pod Disruption Budgets** → Ensures minimum replicas always running |
| 51 | +4. **Karpenter Consolidation** → Automatically moves pods before termination |
| 52 | +
|
| 53 | +**Example Instance Type Diversity:** |
| 54 | +
|
| 55 | +``` |
| 56 | +Spot Pool: t4g.medium, t4g.large, t3a.medium, t3a.large, |
| 57 | + m6g.medium, m6g.large, m6a.medium, m6a.large |
| 58 | +``` |
| 59 | +
|
| 60 | +### Cost Breakdown |
| 61 | +
|
| 62 | +| Component | Instance Type | Capacity | Monthly Cost | |
| 63 | +| ------------------ | ------------- | --------- | ------------- | |
| 64 | +| System Nodes (2) | t4g.medium | ON_DEMAND | $48 | |
| 65 | +| Coder Server (2) | t4g.large | 80% SPOT | $28 (vs $140) | |
| 66 | +| Workspaces (avg 5) | t4g.xlarge | 90% SPOT | $75 (vs $750) | |
| 67 | +| **Total** | | **Mixed** | **$151/mo** | |
| 68 | +
|
| 69 | +**vs All On-Demand:** $938/month → **84% savings** |
| 70 | +
|
| 71 | +### Dynamic Scaling |
| 72 | +
|
| 73 | +**Low Usage (nights/weekends):** |
| 74 | +
|
| 75 | +- Scale to zero workspaces |
| 76 | +- Keep 1 system node + 1 Coder server node |
| 77 | +- Cost: ~$48/month during idle |
| 78 | +
|
| 79 | +**High Usage (business hours):** |
| 80 | +
|
| 81 | +- Auto-scale workspaces on Spot |
| 82 | +- Karpenter provisions nodes in <60 seconds |
| 83 | +- Cost: ~$150-200/month during peak |
| 84 | +
|
| 85 | +### Monitoring & Alerts |
| 86 | +
|
| 87 | +**CloudWatch Alarms:** |
| 88 | +
|
| 89 | +- Spot interruption rate > 5% |
| 90 | +- Available On-Demand capacity < 20% |
| 91 | +- Karpenter provisioning failures |
| 92 | +
|
| 93 | +**Response:** |
| 94 | +
|
| 95 | +- Automatic fallback to On-Demand |
| 96 | +- Email alerts to ops team |
| 97 | +- Karpenter adjusts instance type mix |
| 98 | +
|
| 99 | +## Implementation Timeline |
| 100 | +
|
| 101 | +1. ✅ Deploy EKS with ON_DEMAND system nodes |
| 102 | +2. ⏳ Deploy Karpenter |
| 103 | +3. ⏳ Configure mixed-capacity NodePools |
| 104 | +4. ⏳ Deploy Coder with node affinity rules |
| 105 | +5. ⏳ Test Spot interruption handling |
| 106 | +6. ⏳ Enable auto-scaling policies |
| 107 | +
|
| 108 | +## Fallback Plan |
| 109 | +
|
| 110 | +If Spot becomes unreliable (rare): |
| 111 | +
|
| 112 | +1. Update Karpenter NodePool to 100% On-Demand |
| 113 | +2. `kubectl apply -f nodepool-ondemand.yaml` |
| 114 | +3. Karpenter gracefully migrates pods |
| 115 | +4. Takes ~5 minutes, zero downtime |
| 116 | + |
| 117 | +## Best Practices |
| 118 | + |
| 119 | +✅ **DO:** |
| 120 | + |
| 121 | +- Use multiple Spot instance types (10+) |
| 122 | +- Set Pod Disruption Budgets |
| 123 | +- Monitor Spot interruption rates |
| 124 | +- Test failover regularly |
| 125 | + |
| 126 | +❌ **DON'T:** |
| 127 | + |
| 128 | +- Run databases on Spot (use RDS) |
| 129 | +- Use Spot for single-replica critical services |
| 130 | +- Rely on single instance type for Spot |
0 commit comments