Native client-go implementation (big rewrite)#129
Native client-go implementation (big rewrite)#129inteon wants to merge 2 commits intocert-manager:mainfrom
Conversation
4688445 to
69c45fe
Compare
34afd96 to
fbac5d7
Compare
There was a problem hiding this comment.
Pull request overview
This pull request represents a major architectural rewrite of the webhook certificate authority controller, migrating from controller-runtime to a native client-go implementation. The changes introduce significant improvements in certificate rotation handling, particularly for scenarios with very short certificate lifetimes and large numbers of controllers.
Changes:
- Complete migration from controller-runtime to native client-go with custom informer factories
- New pending/serving CA certificate rotation mechanism with promotion delay to ensure smooth transitions
- Tests now use Go's
testing/synctestpackage instead of Ginkgo/Gomega - Removed unused internal error package and simplified PKI utilities
Reviewed changes
Copilot reviewed 72 out of 77 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/combined_controller_test.go | New comprehensive test suite using synctest for authority controller behavior |
| test/util_test.go | Helper functions for testing CA secrets and ValidatingWebhookConfigurations |
| test/delayed_watch.go | Simulates delayed watch events for testing race conditions |
| pkg/authority/authority.go | Core authority implementation with native client-go and workqueue |
| pkg/authority/options.go | Configuration validation and defaults for authority setup |
| pkg/authority/injectable/validating_webhook.go | Refactored CA bundle injection using informers and listers |
| internal/pki/cert_parser.go | New certificate parser with caching to avoid redundant parsing |
| internal/certificate/tls.go | Certificate generation utilities |
| examples/webhook-controller/main.go | Example webhook server demonstrating library usage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: inteon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
erikgb
left a comment
There was a problem hiding this comment.
Thanks, @inteon! This is a HUGE PR, and I have no chance to review it properly. Are you able to split it up? You seem to change "everything" without explaining the motivations for it. 😅 I intentionally didn't change the code copied from cert-manager (PKI) and trust-manager (CertPool) to allow us to eventually stop this copy-pasting across projects. And now you are rewriting most of it.....
I have added some initial comments/questions, but I will not be able to LGTM this with the current size of the PR.
| Namespace: "my-namespace", | ||
| Name: "my-webhook-ca", | ||
| }, | ||
| Duration: 1 * time.Hour, // Optional: Default 1h |
There was a problem hiding this comment.
I think a default of 1 hour is too short. 😅
There was a problem hiding this comment.
I believe that if we build it to renew frequently by default, this forces us to make renewals happen without downtime.
Atm. renewal does not happen often and can be a hard to debug cause for temporary downtime.
There was a problem hiding this comment.
Also, by using short-lived certs, we improve our security posture.
|
|
||
| ### CA Rotation Lifecycle | ||
|
|
||
| To ensure zero downtime, the controller uses a "Pending CA" strategy. This allows the new CA to be distributed to all clients (via the `ValidatingWebhookConfiguration` CA bundle) *before* it is used to sign the serving certificate. |
There was a problem hiding this comment.
Is the formulation "(via the ValidatingWebhookConfiguration CA bundle" fully correct? We also support injecting other resources with a CA bundle.
There was a problem hiding this comment.
I don't think we implemented other resources yet. I would updated the README once we add support for the other resources.
|
|
||
| - **Leader first**: The controller that issued the pending CA can perform injection and promotion directly, while other controllers have to wait between 3-6s first. | ||
| - **Jitter**: Non-issuing controllers wait with randomized jitter (between 3 and 6 seconds) to avoid thundering-herd updates. | ||
| - **Renewal window**: The renewal timestamp is randomized to avoid all controllers renewing the certs at the same time. |
There was a problem hiding this comment.
I don't know if we should reimplement the old, obsolete leader election mechanism using configs/secrets, and I'd prefer to use the standard leader election mechanism.
There was a problem hiding this comment.
| # CA bundle will be injected by the Authority controller; keep empty here. | ||
| caBundle: "" |
There was a problem hiding this comment.
| # CA bundle will be injected by the Authority controller; keep empty here. | |
| caBundle: "" | |
| # CA bundle will be injected by the Authority controller, so omit it here. |
| func main() { | ||
| kubeconfig := flag.String("kubeconfig", "", "Path to kubeconfig (optional)") | ||
| addr := flag.String("addr", ":8443", "Address to serve HTTPS on") | ||
| serviceName := flag.String("service-name", "example-webhook", "will be used in the DNS name for the webhook server certificate (<service-name>.<namespace>.svc)") |
There was a problem hiding this comment.
Maybe also add a flag for the CA secret name. Mainly to make the RBAC with resourceNames less magic?
There was a problem hiding this comment.
| // TLSPendingCertKey stores a pending (new) CA cert PEM while it is being | ||
| // propagated to targets. It will be promoted to `tls_serving.crt` after | ||
| // `Options.PropagationDelay` has elapsed since the rotation timestamp. | ||
| TLSPendingCertKey = corev1.TLSCertKey |
There was a problem hiding this comment.
As discussed previously, I would find it less confusing if tls.crt contained the currently used serving CA, and instead use something like tls-pending.crt for the pending one. And OFC the same for the key.
There was a problem hiding this comment.
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
|
@erikgb I added your outstanding comments to the TODO file. Next, I would like to figure out if we can use cert-manager as a source for our CA certificates. This will also help us make some of these design decisions. Also, I think we should try improve our test framework further. Adding even more failure/ delay points to our simulation should help us make decisions when it comes to eg. leader election. |
I spent a lot of time porting this controller to client-go and iterating on the solution.
Current state of the PR:
testing/synctestpackageYou can run
make test-smoketo test the example controller./cc @erikgb