feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852
feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852endoze wants to merge 5 commits intoAgenta-AI:mainfrom
Conversation
|
@endoze is attempting to deploy a commit to the agenta projects Team on Vercel. A member of the Team first needs to authorize it. |
mmabrouk
left a comment
There was a problem hiding this comment.
Thank you for putting this together. This is a solid Helm chart that correctly models our docker-compose architecture. The dual Redis setup, the existingSecret pattern, and the external database support are all done well. I appreciate the comprehensive documentation updates too.
I reviewed the chart against Helm community best practices and tested it locally. Below are my findings.
What I Did
I compared this PR against our current docker-compose infrastructure and an older Helm chart attempt (PR #2775). I also ran the chart through multiple validation layers: helm lint, helm template with dry-run, and a full install in a Kind cluster.
The lint and template steps passed. The cluster install revealed two issues that block deployment.
Critical Issues
1. Helm hook ordering causes a deadlock
The Alembic job uses pre-install,pre-upgrade hooks. However, it depends on PostgreSQL, which is part of the main release. Helm runs hooks before the main release. This means Alembic waits for a PostgreSQL that does not exist yet.
The install times out after 10 minutes with the init container stuck waiting.
To fix this the agent suggest the following:
Change the hook to post-install,post-upgrade in templates/alembic-job.yaml:
annotations:
helm.sh/hook: post-install,post-upgrade
helm.sh/hook-weight: "0"2. PostgreSQL image tag does not exist
The bundled Bitnami PostgreSQL subchart (v16.4.16) defaults to an image tag that has been removed from Docker Hub:
Failed to pull image "docker.io/bitnami/postgresql:17.4.0-debian-12-r4": not found
Best Practice Improvements
These are not blockers, but they would strengthen the chart.
Security contexts. The chart defaults to empty security contexts, which means containers run as root. The Helm community recommends setting secure defaults:
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: [ALL]
seccompProfile:
type: RuntimeDefaultImage tags default to latest. This makes deployments unpredictable. Consider defaulting to .Chart.AppVersion instead:
{{ .Values.api.image.tag | default .Chart.AppVersion }}
No values.schema.json. A JSON Schema catches misconfiguration at install time rather than at runtime. This is especially helpful for required fields like secrets.agentaAuthKey.
PostgreSQL password sync. Users must set both secrets.postgresPassword and postgresql.auth.password to the same value. If they mismatch, the app cannot connect. Consider wiring the subchart to use the chart-managed secret via postgresql.auth.existingSecret.
No startup probes. The deployments have liveness and readiness probes, but no startup probes. If the API takes longer than 30 seconds to start, Kubernetes will kill it. Startup probes give slow-starting containers more time.
Empty resource defaults. All components default to resources: {}. This means pods get "BestEffort" QoS class and are first to be evicted under memory pressure. Consider adding suggested defaults or a production values example.
Missing .helmignore. Without this file, the packaged chart may include unnecessary files.
No lint step in CI. The GitHub Actions workflow packages and pushes to GHCR, but it does not run helm lint or ct lint. Adding these steps would catch issues before publishing.
4a1e166 to
e499e94
Compare
|
Thank you very much for such quick feedback on my contribution! I've updated my commit to address your feedback. One major thing to note, I swapped the bitnami Postgres chart to a newer version which will deploy a newer version of Postgres as well. My cursory look through the codebase led me to think this is a safe upgrade but I'm curious of your thoughts on this. As for why I upgraded it, bitnami only keeps around so many old tags before they clean things up so I chose a much newer version of things to prolong its viability. I can adjust this as needed however to use the new chart version and default to a specific version of Postgres as necessary to meet the project's database needs. Let me know if you find any other issues and I'll do my best to address them. |
e499e94 to
ee80414
Compare
ee80414 to
9cf74d2
Compare
Follow-up: Full Cluster Testing ResultsI deployed the chart on a k3s cluster (v1.33) and tested end-to-end in a browser. Thank you for addressing all the points from my first review -- the post-install hook, PostgreSQL upgrade, security contexts, startup probes, values schema, lint CI step, and shared PostgreSQL secret all look good. The chart works, but I found three bugs and one documentation gap during testing. Details below. Bugs Found1.
Fix: change 2. Web container is unreachable (Next.js binding) Next.js 15 defaults to binding on the pod hostname, not Fix: add env:
- name: HOSTNAME
value: "0.0.0.0"3. The security contexts set Short-term fix: change the default to Long-term: we need to update our Dockerfiles to run as non-root (tracked in #3868). Once that ships, the chart can flip back to Nginx Ingress: Paths Need Regex Capture GroupsThe ingress template uses plain paths ( For nginx, the The docs correctly tell nginx users to set with This is tricky to fix in the template since Traefik needs Testing Summary
Cluster: k3s v1.33 on Hetzner, nginx ingress controller, all images pulled with I pushed a commit with all three fixes plus documentation improvements to your branch. |
Follow-up: Configurable ingress paths for NGINX supportPushed a second commit ( ProblemThe ingress template hardcoded FixAdded NGINX users override like this: ingress:
className: "nginx"
annotations:
nginx.ingress.kubernetes.io/use-regex: "true"
nginx.ingress.kubernetes.io/rewrite-target: /$1
paths:
api:
path: /api/(.*)
pathType: ImplementationSpecific
services:
path: /services/(.*)
pathType: ImplementationSpecific
web:
path: /(.*)
pathType: ImplementationSpecificVerifiedUpgraded the chart on the test cluster (k3s + NGINX Ingress Controller) with the new path overrides. All routes work:
Also updated |
|
@mmabrouk Do you want me to address Devin-ai's latest comment (which does indeed seem to be a logical hole) or did you want to? Happy to do so but I don't want to step on anyone's toes 😄 |
|
@endoze I'd be thankful if you did :) |
Enable self-hosted Kubernetes deployments as an alternative to docker-compose. The chart packages all Agenta OSS components (API, web, services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami PostgreSQL as a subchart dependency, Alembic migrations as a pre-install/pre-upgrade hook, and an optional ingress resource. Includes a CI workflow to publish the chart to GHCR on changes.
- Fix appVersion to use v-prefixed tag (v0.86.8) matching GHCR images - Add HOSTNAME=0.0.0.0 to web deployment so Next.js binds to all interfaces - Change runAsNonRoot default to false (images currently run as root) - Document PostgreSQL secret name dependency on release name - Document ingress className default (traefik) with override instructions
The ingress template previously hardcoded Prefix paths which only work
with Traefik. NGINX Ingress Controller requires regex capture groups
and ImplementationSpecific pathType for rewrite-target to work.
Add ingress.paths.{api,services,web} to values.yaml so users can
override path patterns and pathType per backend. Defaults remain
Prefix (backward compatible with Traefik). Update docs with the
full nginx configuration including path overrides.
When users provide a pre-created Kubernetes Secret via secrets.existingSecret, the Bitnami PostgreSQL subchart silently fails to find the password unless global.postgresql.auth.existingSecret is also pointed at the same secret. This adds a fail-fast validation template and clearer NOTES.txt guidance so users get an actionable error at install time instead of a broken deployment.
3f1dd6e to
26291b7
Compare
|
@mmabrouk I've rebased the branch off the latest from main as well as addressed the last bit of feedback from Devin-ai's review. Let me know if you need anything else on this branch. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@all-contributors please add @endoze for infrastructure and docs and infra |
|
I've put up a pull request to add @endoze! 🎉 |
mmabrouk
left a comment
There was a problem hiding this comment.
Many thanks @endoze this looks great!
@jp-agenta lgtm from my side, I did a final test locally on k3 and it worked all fine.
|
Hey @endoze feel free to share your linkedin or twitter if you would like to be mentioned in a post when we merge this |
Enable self-hosted Kubernetes deployments as an alternative to docker-compose. The chart packages all Agenta OSS components (API, web, services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami PostgreSQL as a subchart dependency, Alembic migrations as a pre-install/pre-upgrade hook, and an optional ingress resource. Includes a CI workflow to publish the chart to GHCR on changes.