Skip to content

feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852

Open
endoze wants to merge 5 commits intoAgenta-AI:mainfrom
endoze:feat-add-helm-chart-for-agenta-oss
Open

feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852
endoze wants to merge 5 commits intoAgenta-AI:mainfrom
endoze:feat-add-helm-chart-for-agenta-oss

Conversation

@endoze
Copy link

@endoze endoze commented Feb 26, 2026

Enable self-hosted Kubernetes deployments as an alternative to docker-compose. The chart packages all Agenta OSS components (API, web, services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami PostgreSQL as a subchart dependency, Alembic migrations as a pre-install/pre-upgrade hook, and an optional ingress resource. Includes a CI workflow to publish the chart to GHCR on changes.


Open with Devin

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 26, 2026
@CLAassistant
Copy link

CLAassistant commented Feb 26, 2026

CLA assistant check
All committers have signed the CLA.

@vercel
Copy link

vercel bot commented Feb 26, 2026

@endoze is attempting to deploy a commit to the agenta projects Team on Vercel.

A member of the Team first needs to authorize it.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together. This is a solid Helm chart that correctly models our docker-compose architecture. The dual Redis setup, the existingSecret pattern, and the external database support are all done well. I appreciate the comprehensive documentation updates too.

I reviewed the chart against Helm community best practices and tested it locally. Below are my findings.


What I Did

I compared this PR against our current docker-compose infrastructure and an older Helm chart attempt (PR #2775). I also ran the chart through multiple validation layers: helm lint, helm template with dry-run, and a full install in a Kind cluster.
The lint and template steps passed. The cluster install revealed two issues that block deployment.


Critical Issues

1. Helm hook ordering causes a deadlock
The Alembic job uses pre-install,pre-upgrade hooks. However, it depends on PostgreSQL, which is part of the main release. Helm runs hooks before the main release. This means Alembic waits for a PostgreSQL that does not exist yet.
The install times out after 10 minutes with the init container stuck waiting.

To fix this the agent suggest the following:

Change the hook to post-install,post-upgrade in templates/alembic-job.yaml:

annotations:
  helm.sh/hook: post-install,post-upgrade
  helm.sh/hook-weight: "0"

2. PostgreSQL image tag does not exist
The bundled Bitnami PostgreSQL subchart (v16.4.16) defaults to an image tag that has been removed from Docker Hub:

Failed to pull image "docker.io/bitnami/postgresql:17.4.0-debian-12-r4": not found

Best Practice Improvements

These are not blockers, but they would strengthen the chart.

Security contexts. The chart defaults to empty security contexts, which means containers run as root. The Helm community recommends setting secure defaults:

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: [ALL]
  seccompProfile:
    type: RuntimeDefault

Image tags default to latest. This makes deployments unpredictable. Consider defaulting to .Chart.AppVersion instead:

{{ .Values.api.image.tag | default .Chart.AppVersion }}

No values.schema.json. A JSON Schema catches misconfiguration at install time rather than at runtime. This is especially helpful for required fields like secrets.agentaAuthKey.

PostgreSQL password sync. Users must set both secrets.postgresPassword and postgresql.auth.password to the same value. If they mismatch, the app cannot connect. Consider wiring the subchart to use the chart-managed secret via postgresql.auth.existingSecret.

No startup probes. The deployments have liveness and readiness probes, but no startup probes. If the API takes longer than 30 seconds to start, Kubernetes will kill it. Startup probes give slow-starting containers more time.

Empty resource defaults. All components default to resources: {}. This means pods get "BestEffort" QoS class and are first to be evicted under memory pressure. Consider adding suggested defaults or a production values example.

Missing .helmignore. Without this file, the packaged chart may include unnecessary files.

No lint step in CI. The GitHub Actions workflow packages and pushes to GHCR, but it does not run helm lint or ct lint. Adding these steps would catch issues before publishing.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from 4a1e166 to e499e94 Compare February 27, 2026 15:53
@endoze
Copy link
Author

endoze commented Feb 27, 2026

Thank you very much for such quick feedback on my contribution!

I've updated my commit to address your feedback. One major thing to note, I swapped the bitnami Postgres chart to a newer version which will deploy a newer version of Postgres as well. My cursory look through the codebase led me to think this is a safe upgrade but I'm curious of your thoughts on this. As for why I upgraded it, bitnami only keeps around so many old tags before they clean things up so I chose a much newer version of things to prolong its viability. I can adjust this as needed however to use the new chart version and default to a specific version of Postgres as necessary to meet the project's database needs.

Let me know if you find any other issues and I'll do my best to address them.

devin-ai-integration[bot]

This comment was marked as resolved.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from e499e94 to ee80414 Compare February 28, 2026 03:12
devin-ai-integration[bot]

This comment was marked as resolved.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from ee80414 to 9cf74d2 Compare February 28, 2026 16:09
@mmabrouk
Copy link
Member

mmabrouk commented Mar 1, 2026

Follow-up: Full Cluster Testing Results

I deployed the chart on a k3s cluster (v1.33) and tested end-to-end in a browser. Thank you for addressing all the points from my first review -- the post-install hook, PostgreSQL upgrade, security contexts, startup probes, values schema, lint CI step, and shared PostgreSQL secret all look good.

The chart works, but I found three bugs and one documentation gap during testing. Details below.


Bugs Found

1. appVersion is missing the v prefix (image pull fails)

Chart.yaml has appVersion: "0.86.8", but the GHCR images are tagged v0.86.8 (with the v). A default install without explicit image tag overrides will fail with ImagePullBackOff because the tag 0.86.8 does not exist.

Fix: change appVersion: "0.86.8" to appVersion: "v0.86.8" in Chart.yaml.

2. Web container is unreachable (Next.js binding)

Next.js 15 defaults to binding on the pod hostname, not 0.0.0.0. Health probes and ingress traffic connect via localhost or the pod IP, so they cannot reach the web server. All readiness/liveness probes fail and the pod enters CrashLoopBackOff.

Fix: add HOSTNAME=0.0.0.0 to the web deployment env vars in templates/web-deployment.yaml:

env:
  - name: HOSTNAME
    value: "0.0.0.0"

3. runAsNonRoot: true crashes all pods

The security contexts set runAsNonRoot: true, but our Docker images currently run as root (USER is not set in the Dockerfiles). Every pod fails immediately with a security context violation.

Short-term fix: change the default to runAsNonRoot: false in values.yaml for all components.

Long-term: we need to update our Dockerfiles to run as non-root (tracked in #3868). Once that ships, the chart can flip back to true.


Nginx Ingress: Paths Need Regex Capture Groups

The ingress template uses plain paths (/api, /services, /) with pathType: Prefix. This works with Traefik's StripPrefix middleware, but not with nginx's rewrite-target annotation.

For nginx, the rewrite-target: /$1 annotation requires regex capture groups in the paths. Without them, $1 is empty and everything rewrites to /, causing a redirect loop on the web frontend.

The docs correctly tell nginx users to set rewrite-target and use-regex annotations, but the chart's hardcoded paths won't work with those annotations. Users would need to manually patch the ingress paths to:

/api/(.*)
/services/(.*)
/(.*)

with pathType: ImplementationSpecific.

This is tricky to fix in the template since Traefik needs Prefix paths and nginx needs ImplementationSpecific regex paths. One option: add a ingress.pathOverrides value, or detect the className and switch path styles. Or just document it clearly for now and fix in a follow-up.


Testing Summary

Test Result
helm lint Pass
helm template --dry-run Pass
Cluster install (all 11 pods) Pass (with the three fixes above)
helm test Pass
Migration job (Alembic) Completed successfully
Web UI in browser Works (login, navigation)
API health {"status":"ok"}
Services health 200 OK

Cluster: k3s v1.33 on Hetzner, nginx ingress controller, all images pulled with tag: latest.


I pushed a commit with all three fixes plus documentation improvements to your branch.

@mmabrouk
Copy link
Member

mmabrouk commented Mar 1, 2026

Follow-up: Configurable ingress paths for NGINX support

Pushed a second commit (1c80b77) that makes ingress paths configurable via values.yaml.

Problem

The ingress template hardcoded Prefix paths (/api, /services, /). This works with Traefik's StripPrefix middleware, but NGINX Ingress Controller needs regex capture groups in the paths for rewrite-target to work. Without them, $1 is empty and the web frontend gets stuck in a redirect loop.

Fix

Added ingress.paths.{api,services,web} to values, each with path and pathType. Defaults are unchanged (Prefix), so Traefik setups are not affected.

NGINX users override like this:

ingress:
  className: "nginx"
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$1
  paths:
    api:
      path: /api/(.*)
      pathType: ImplementationSpecific
    services:
      path: /services/(.*)
      pathType: ImplementationSpecific
    web:
      path: /(.*)
      pathType: ImplementationSpecific

Verified

Upgraded the chart on the test cluster (k3s + NGINX Ingress Controller) with the new path overrides. All routes work:

  • Web: 200 (follows redirect from / to /w)
  • API: {"status":"ok"}
  • Services: 200

Also updated values.schema.json and the Kubernetes deployment docs with the new fields and a complete NGINX example.

devin-ai-integration[bot]

This comment was marked as resolved.

@endoze
Copy link
Author

endoze commented Mar 2, 2026

@mmabrouk Do you want me to address Devin-ai's latest comment (which does indeed seem to be a logical hole) or did you want to? Happy to do so but I don't want to step on anyone's toes 😄

@mmabrouk
Copy link
Member

mmabrouk commented Mar 2, 2026

@endoze I'd be thankful if you did :)

endoze and others added 4 commits March 2, 2026 21:39
Enable self-hosted Kubernetes deployments as an alternative to
docker-compose. The chart packages all Agenta OSS components (API, web,
services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami
PostgreSQL as a subchart dependency, Alembic migrations as a
pre-install/pre-upgrade hook, and an optional ingress resource. Includes
a CI workflow to publish the chart to GHCR on changes.
- Fix appVersion to use v-prefixed tag (v0.86.8) matching GHCR images
- Add HOSTNAME=0.0.0.0 to web deployment so Next.js binds to all interfaces
- Change runAsNonRoot default to false (images currently run as root)
- Document PostgreSQL secret name dependency on release name
- Document ingress className default (traefik) with override instructions
The ingress template previously hardcoded Prefix paths which only work
with Traefik. NGINX Ingress Controller requires regex capture groups
and ImplementationSpecific pathType for rewrite-target to work.

Add ingress.paths.{api,services,web} to values.yaml so users can
override path patterns and pathType per backend. Defaults remain
Prefix (backward compatible with Traefik). Update docs with the
full nginx configuration including path overrides.
When users provide a pre-created Kubernetes Secret via
secrets.existingSecret, the Bitnami PostgreSQL subchart silently
fails to find the password unless global.postgresql.auth.existingSecret
is also pointed at the same secret. This adds a fail-fast validation
template and clearer NOTES.txt guidance so users get an actionable
error at install time instead of a broken deployment.
@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from 3f1dd6e to 26291b7 Compare March 3, 2026 02:40
@endoze
Copy link
Author

endoze commented Mar 3, 2026

@mmabrouk I've rebased the branch off the latest from main as well as addressed the last bit of feedback from Devin-ai's review. Let me know if you need anything else on this branch.

@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Mar 3, 2026 11:55am

Request Review

@mmabrouk
Copy link
Member

mmabrouk commented Mar 3, 2026

@all-contributors please add @endoze for infrastructure and docs and infra

@allcontributors
Copy link
Contributor

@mmabrouk

I've put up a pull request to add @endoze! 🎉

Copy link
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @endoze this looks great!

@jp-agenta lgtm from my side, I did a final test locally on k3 and it worked all fine.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 3, 2026
@mmabrouk
Copy link
Member

mmabrouk commented Mar 3, 2026

Hey @endoze feel free to share your linkedin or twitter if you would like to be mentioned in a post when we merge this

@endoze
Copy link
Author

endoze commented Mar 4, 2026

@mmabrouk Just my GitHub if you'd like. I also sent over a pull request to handle running the containers as non-root as a compliment to this one. #3899 should enable the ability to harden the defaults in the helm chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/cd feature lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants