Skip to content

Commit 1d22666

Browse files
Noah Boyersclaude
andcommitted
feat: major infrastructure optimization and security improvements
- Migrate RDS to Aurora Serverless v2 (Coder & LiteLLM) with auto-scaling - Add VPC endpoints (S3, ECR) to reduce NAT Gateway costs - Optimize EKS with Graviton ARM instances and reduced storage (50GB→20GB) - Reduce Karpenter node volumes (1400Gi→500Gi) for cost efficiency - Add AWS Secrets Manager for secure credential management - Configure SSL termination at NLB with proper redirect handling - Add Karpenter feature gates for spot consolidation - Update workflows and pre-commit config formatting - Add cost optimization strategy documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent c678a34 commit 1d22666

File tree

16 files changed

+492
-115
lines changed

16 files changed

+492
-115
lines changed

.github/workflows/pre-commit-hooks.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ name: Pre-commit Validation
66
on:
77
pull_request:
88
paths:
9-
- '.pre-commit-config.yaml'
10-
- '.github/workflows/pre-commit-hooks.yml'
9+
- ".pre-commit-config.yaml"
10+
- ".github/workflows/pre-commit-hooks.yml"
1111

1212
jobs:
1313
validate-pre-commit:
@@ -19,7 +19,7 @@ jobs:
1919
- name: Set up Python
2020
uses: actions/setup-python@v4
2121
with:
22-
python-version: '3.11'
22+
python-version: "3.11"
2323

2424
- name: Install pre-commit
2525
run: |

.github/workflows/secret-scanning.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ on:
77
push:
88
branches:
99
- main
10-
- 'feature/**'
11-
- 'fix/**'
10+
- "feature/**"
11+
- "fix/**"
1212

1313
permissions:
1414
contents: write
@@ -23,7 +23,7 @@ jobs:
2323
- name: Checkout code
2424
uses: actions/checkout@v4
2525
with:
26-
fetch-depth: 0 # Fetch all history for accurate scanning
26+
fetch-depth: 0 # Fetch all history for accurate scanning
2727

2828
- name: Run Gitleaks
2929
uses: gitleaks/gitleaks-action@v2

.github/workflows/terraform-apply.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ on:
55
branches:
66
- main
77
paths:
8-
- 'infra/aws/**/*.tf'
9-
- 'infra/aws/**/*.tfvars'
10-
- '.github/workflows/terraform-*.yml'
8+
- "infra/aws/**/*.tf"
9+
- "infra/aws/**/*.tfvars"
10+
- ".github/workflows/terraform-*.yml"
1111
workflow_dispatch:
1212
inputs:
1313
module:
14-
description: 'Specific module to apply (leave empty for all changed)'
14+
description: "Specific module to apply (leave empty for all changed)"
1515
required: false
1616
type: string
1717

@@ -65,7 +65,7 @@ jobs:
6565
matrix:
6666
module: ${{ fromJson(needs.detect-changes.outputs.modules) }}
6767
fail-fast: false
68-
max-parallel: 1 # Apply modules one at a time to avoid conflicts
68+
max-parallel: 1 # Apply modules one at a time to avoid conflicts
6969
defaults:
7070
run:
7171
working-directory: ${{ matrix.module }}

.github/workflows/terraform-destroy.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ on:
44
workflow_dispatch:
55
inputs:
66
module:
7-
description: 'Module to destroy (e.g., infra/aws/us-east-2/eks)'
7+
description: "Module to destroy (e.g., infra/aws/us-east-2/eks)"
88
required: true
99
type: string
1010
confirm:

.github/workflows/terraform-plan.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ on:
55
branches:
66
- main
77
paths:
8-
- 'infra/aws/**/*.tf'
9-
- 'infra/aws/**/*.tfvars'
10-
- '.github/workflows/terraform-*.yml'
8+
- "infra/aws/**/*.tf"
9+
- "infra/aws/**/*.tfvars"
10+
- ".github/workflows/terraform-*.yml"
1111

1212
permissions:
1313
contents: read

.pre-commit-config.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,13 @@ repos:
1717
exclude: '\.md$'
1818
- id: end-of-file-fixer
1919
- id: check-yaml
20-
args: ['--unsafe'] # Allow custom YAML tags
20+
args: ["--unsafe"] # Allow custom YAML tags
2121
- id: check-added-large-files
22-
args: ['--maxkb=1000']
22+
args: ["--maxkb=1000"]
2323
- id: check-merge-conflict
2424
- id: detect-private-key
2525
- id: detect-aws-credentials
26-
args: ['--allow-missing-credentials']
26+
args: ["--allow-missing-credentials"]
2727

2828
# Terraform
2929
- repo: https://github.com/antonbabenko/pre-commit-terraform
@@ -47,7 +47,7 @@ repos:
4747
rev: v4.5.0
4848
hooks:
4949
- id: no-commit-to-branch
50-
args: ['--branch', 'main', '--branch', 'master']
50+
args: ["--branch", "main", "--branch", "master"]
5151
stages: [commit]
5252

5353
# Global settings

docs/cost-optimization-strategy.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Cost Optimization Strategy for Coder Demo
2+
3+
## Mixed Capacity Approach
4+
5+
### Node Group Strategy
6+
7+
**System Nodes (ON_DEMAND)**
8+
9+
- **Purpose**: Run critical Kubernetes infrastructure
10+
- **Workloads**: CoreDNS, kube-proxy, metrics-server, cert-manager, AWS LB Controller
11+
- **Size**: t4g.medium (ARM Graviton)
12+
- **Count**: 1-2 nodes minimum
13+
- **Cost**: ~$24/month (1 node) to $48/month (2 nodes)
14+
15+
**Application Nodes (MIXED: 20% On-Demand, 80% Spot via Karpenter)**
16+
17+
- **Purpose**: Run Coder server and workspaces
18+
- **Spot Savings**: 70-90% cost reduction
19+
- **Interruption Risk**: Mitigated by:
20+
- Multiple instance types (diversified Spot pools)
21+
- Karpenter auto-rebalancing
22+
- Pod Disruption Budgets
23+
24+
### Karpenter NodePool Configuration
25+
26+
#### 1. Coder Server NodePool (ON_DEMAND Priority)
27+
28+
```yaml
29+
capacity_type: ["on-demand", "spot"] # Prefer On-Demand, fallback to Spot
30+
weight:
31+
on-demand: 100 # Higher priority
32+
spot: 10
33+
```
34+
35+
#### 2. Coder Workspace NodePool (SPOT Priority)
36+
37+
```yaml
38+
capacity_type: ["spot", "on-demand"] # Prefer Spot, fallback to On-Demand
39+
weight:
40+
spot: 100 # Higher priority
41+
on-demand: 10
42+
```
43+
44+
### Risk Mitigation
45+
46+
**Spot Interruption Handling:**
47+
48+
1. **2-minute warning** → Karpenter automatically provisions replacement
49+
2. **Multiple instance types** → 15+ types reduces interruption rate to <1%
50+
3. **Pod Disruption Budgets** → Ensures minimum replicas always running
51+
4. **Karpenter Consolidation** → Automatically moves pods before termination
52+
53+
**Example Instance Type Diversity:**
54+
55+
```
56+
Spot Pool: t4g.medium, t4g.large, t3a.medium, t3a.large,
57+
m6g.medium, m6g.large, m6a.medium, m6a.large
58+
```
59+
60+
### Cost Breakdown
61+
62+
| Component | Instance Type | Capacity | Monthly Cost |
63+
| ------------------ | ------------- | --------- | ------------- |
64+
| System Nodes (2) | t4g.medium | ON_DEMAND | $48 |
65+
| Coder Server (2) | t4g.large | 80% SPOT | $28 (vs $140) |
66+
| Workspaces (avg 5) | t4g.xlarge | 90% SPOT | $75 (vs $750) |
67+
| **Total** | | **Mixed** | **$151/mo** |
68+
69+
**vs All On-Demand:** $938/month → **84% savings**
70+
71+
### Dynamic Scaling
72+
73+
**Low Usage (nights/weekends):**
74+
75+
- Scale to zero workspaces
76+
- Keep 1 system node + 1 Coder server node
77+
- Cost: ~$48/month during idle
78+
79+
**High Usage (business hours):**
80+
81+
- Auto-scale workspaces on Spot
82+
- Karpenter provisions nodes in <60 seconds
83+
- Cost: ~$150-200/month during peak
84+
85+
### Monitoring & Alerts
86+
87+
**CloudWatch Alarms:**
88+
89+
- Spot interruption rate > 5%
90+
- Available On-Demand capacity < 20%
91+
- Karpenter provisioning failures
92+
93+
**Response:**
94+
95+
- Automatic fallback to On-Demand
96+
- Email alerts to ops team
97+
- Karpenter adjusts instance type mix
98+
99+
## Implementation Timeline
100+
101+
1. ✅ Deploy EKS with ON_DEMAND system nodes
102+
2. ⏳ Deploy Karpenter
103+
3. ⏳ Configure mixed-capacity NodePools
104+
4. ⏳ Deploy Coder with node affinity rules
105+
5. ⏳ Test Spot interruption handling
106+
6. ⏳ Enable auto-scaling policies
107+
108+
## Fallback Plan
109+
110+
If Spot becomes unreliable (rare):
111+
112+
1. Update Karpenter NodePool to 100% On-Demand
113+
2. `kubectl apply -f nodepool-ondemand.yaml`
114+
3. Karpenter gracefully migrates pods
115+
4. Takes ~5 minutes, zero downtime
116+
117+
## Best Practices
118+
119+
✅ **DO:**
120+
121+
- Use multiple Spot instance types (10+)
122+
- Set Pod Disruption Budgets
123+
- Monitor Spot interruption rates
124+
- Test failover regularly
125+
126+
❌ **DON'T:**
127+
128+
- Run databases on Spot (use RDS)
129+
- Use Spot for single-replica critical services
130+
- Rely on single instance type for Spot

infra/aws/us-east-2/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This directory uses remote S3 backend for state management, but **backend config
77
## Local Setup
88

99
1. **Get backend configuration from teammate** or **retrieve from AWS**:
10+
1011
```bash
1112
# Get S3 bucket name (it contains the account ID)
1213
aws s3 ls | grep terraform-state
@@ -24,6 +25,7 @@ This directory uses remote S3 backend for state management, but **backend config
2425
```
2526

2627
Create `backend.tf`:
28+
2729
```hcl
2830
terraform {
2931
backend "s3" {
@@ -62,6 +64,7 @@ These are configured in: Repository Settings > Secrets and variables > Actions
6264
Instead of creating backend.tf, you can use a config file:
6365

6466
1. Create `backend.conf` (gitignored):
67+
6568
```
6669
bucket = "YOUR-BUCKET-NAME"
6770
dynamodb_table = "YOUR-TABLE-NAME"
@@ -86,12 +89,14 @@ Instead of creating backend.tf, you can use a config file:
8689
This repository has automated secret scanning to prevent accidental exposure of credentials:
8790

8891
### GitHub Actions (Automated)
92+
8993
- **Gitleaks** - Scans every PR and push for secrets
9094
- **TruffleHog** - Additional verification layer
9195
- **Custom Pattern Matching** - Catches common secret patterns
9296
- **Auto-Revert** - Automatically reverts commits to main with secrets
9397

9498
### Pre-commit Hooks (Local)
99+
95100
Catch secrets before they reach GitHub:
96101

97102
```bash
@@ -106,6 +111,7 @@ pre-commit run --all-files
106111
```
107112

108113
### What Gets Detected
114+
109115
- AWS Access Keys (AKIA...)
110116
- API Keys and Tokens
111117
- Private Keys (RSA, SSH, etc.)
@@ -115,6 +121,7 @@ pre-commit run --all-files
115121
- High-entropy strings (likely secrets)
116122

117123
### If Secrets Are Detected
124+
118125
1. **PR is blocked** - Cannot merge until secrets are removed
119126
2. **Automatic notification** - PR comment explains the issue
120127
3. **Required actions**:

0 commit comments

Comments
 (0)