hscontrol: structured config validation and typed listener errors#3236
hscontrol: structured config validation and typed listener errors#3236kradalby wants to merge 15 commits into
Conversation
f9d63c4 to
adeee64
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves Headscale’s operator experience around configuration and listener binding by introducing structured, multi-error config validation and typed listener bind errors, including detection of internal listener collisions (notably listen_addr vs ACME HTTP-01).
Changes:
- Add structured config error reporting (
ConfigError+configValidator) that aggregates multiple validation failures usingerrors.Join. - Detect and reject TCP listener collisions at config load time (including ACME HTTP-01 listener conflicts) and document the common port-80 pitfall.
- Introduce
*ListenerBindErrorfor runtime bind failures and add CLI hints forEADDRINUSE/EACCES; move ACME HTTP-01 serving into the main errgroup and shutdown path.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| hscontrol/types/listener.go | Adds ListenerBindError, port parsing helpers, and listener-collision validation. |
| hscontrol/types/listener_test.go | Unit tests for port parsing, overlap logic, wildcard detection, and ListenerBindError. |
| hscontrol/types/config.go | Refactors validation into an aggregator, adds listener-collision validation, and converts several log.Fatal validations into ConfigErrors. |
| hscontrol/types/config_test.go | Updates expectations for new structured validation output and adds coverage for collision detection and multi-error aggregation. |
| hscontrol/types/config_errors.go | Introduces structured ConfigError, ErrConfig sentinel, validator aggregation, and tree-walking helpers. |
| hscontrol/types/config_errors_test.go | Tests rendering and errors.Is/As behavior for ConfigError and joined validation errors. |
| hscontrol/app.go | Uses typed bind errors, replaces ACME HTTP-01 goroutine+log.Fatal with errgroup-managed server + shutdown integration, and refactors TLS settings return type. |
| cmd/headscale/cli/serve.go | Adds serve-error classification to append actionable hints for bind failures. |
| cmd/headscale/cli/serve_test.go | Tests serve error classification and hint rendering. |
| docs/ref/tls.md | Documents the listen_addr vs ACME HTTP-01 port-80 collision and the new startup validation behavior. |
| config-example.yaml | Adds guidance around ACME HTTP-01 listener binding and collision avoidance. |
| CHANGELOG.md | Documents the new config validation behavior and typed listener bind errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8e39300 to
af6bed9
Compare
Foundation for restructuring config validation feedback. Typed error with errors.Is/As/Join hooks, structured fields rendered as a multi-line operator-facing block (current, conflicts with, allowed, minimum, maximum, why, hint, see), and a configValidator that joins violations via errors.Join. ConfigErrors walks the tree to collect every *ConfigError, and Cause keeps existing sentinel identities reachable through errors.Is. Updates juanfont#3227
portFromAddr resolves numeric and named ports (":http", ":https")
without touching /etc/services or the resolver. listenersOverlap
follows kernel rules: same port plus a wildcard host on either side,
or same port plus identical specific host, both count as collision;
different specific hosts on the same port do not. Set an explicit
viper default for tls_letsencrypt_listen so a minimal config still
resolves to ":http".
Updates juanfont#3227
Reject configurations where two configured TCP listeners would bind the same kernel socket. Covers every pair of listen_addr, grpc_listen_addr, metrics_listen_addr, and tls_letsencrypt_listen (when HTTP-01 ACME is in use). Each violation renders as a structured ConfigError naming both YAML keys, the values the operator wrote, and a hint pointing at the canonical setup. Closes the symptom in juanfont#3227: the misconfiguration is now rejected at config-load time with a self-explaining error, instead of failing at runtime with a misleading "address already in use" log because the ACME challenge listener and the main HTTP listener competed for the same port inside the same process. Fixes juanfont#3227
Convert the errorText accumulator inside validateServerConfig to the typed configValidator pattern. Each rule violation now renders as a multi-line block naming the YAML key, the value the operator wrote, and a hint pointing at the resolution. The dns.extra_records mutex that previously crashed via log.Fatal is folded into the same collector so an operator sees every problem in one go instead of fixing them one startup attempt at a time. Test wantErr assertions on validation output switch to substring match because the rendered errors are now multi-line. Updates juanfont#3227
derpConfig, databaseConfig, and dnsToTailcfgDNS used log.Fatal to reject invalid combinations the moment they were observed. Lift those checks into modular validators (validateDERPConfig, validateDatabaseConfig, validateMagicDNSConfig) called from validateServerConfig so each violation lands in the same configValidator collector and renders as a structured ConfigError next to all other config feedback. The sub-builder functions trust validation has run and no longer crash the process. Updates juanfont#3227
ListenerBindError wraps a TCP listener bind failure with the listener
name and the YAML key that drove its address. Preserves the underlying
*net.OpError via Unwrap so errors.Is(err, syscall.EADDRINUSE) keeps
working through any number of fmt.Errorf("%w") wraps. Lets the
top-level CLI classifier match listener failures with errors.As
instead of string matching.
Updates juanfont#3227
Replace the three "binding to TCP address" wraps with typed ListenerBindError values that carry the listener role, the YAML key that drove the address, and the resolved address. The wrapped error chain still walks to syscall.EADDRINUSE / EACCES so existing callers that match those sentinels keep working; the difference is that the operator now sees which listener failed instead of an unattributed "binding to TCP address" line. Updates juanfont#3227
The HTTP-01 challenge listener was launched in an orphan goroutine that called log.Fatal on bind failure, which bypassed the signal-handler shutdown path and made bind errors look identical to the main HTTP listener. Bind eagerly via net.ListenConfig in getTLSSettings, return the listener + http.Server in a tlsBundle, register the Serve loop with the existing errgroup, and call Shutdown alongside the other servers. Bind failures now surface as ListenerBindError(Listener:"ACME HTTP-01 challenge") through the normal error chain. Updates juanfont#3227
A *ListenerBindError that wraps syscall.EADDRINUSE now ends with a
"sudo ss -tlnp 'sport = :PORT'" pointer, and one wrapping
syscall.EACCES with a CAP_NET_BIND_SERVICE / setcap pointer. The
underlying chain is preserved via fmt.Errorf("%w"), so errors.Is /
errors.As continue to walk to the typed bind error and the syscall
errno.
Drop the "headscale ran into an error and had to shut down" wrap,
which only restated the symptom.
Export types.PortFromAddr so the classifier can render the port
number in the hint.
Updates juanfont#3227
af6bed9 to
6b258ee
Compare
| # | ||
| # For production: | ||
| # listen_addr: 0.0.0.0:8080 | ||
| # |
There was a problem hiding this comment.
Can we remove this section? The config file is already quite long and headscale checks and refuses to start if it detects that misconfiguration?
| # verification endpoint, and it will be listening on: | ||
| # :http = port 80 | ||
| # | ||
| # This must not collide with listen_addr, grpc_listen_addr, or |
There was a problem hiding this comment.
Same as above, I think it should be removed altogether.
66606d5 to
b87e77d
Compare
Add a callout under the HTTP-01 docs so operators see the constraint before they hit the validation error at startup. Updates juanfont#3227
deprecator.Log() called log.Fatal directly, so a single deprecated
key killed the process before any other validation rule could
report. Operators fixing one issue at a time.
Replace Log() with Apply(*configValidator). Warns still go to
log.Warn; fatals are pushed onto the validator as *ConfigError so
they merge with the rest of the config-load report. The free-form
strings collected via set.Set are gone; deprecator now stores
typed deprecation{OldKey, NewKey} records and Apply renders them
into the structured ConfigError shape.
Move the early-return PKCE check into a new validatePKCEConfig
modular validator for the same reason.
Updates juanfont#3227
Sub-builders called after validateServerConfig (prefixV4, prefixV6,
allocation strategy, dns, oidc client secret/path, isSafeServerURL)
each returned the first error. So an operator that fixed one
issue saw the next one only on the next startup.
Lift one *configValidator across the entire LoadServerConfig flow.
validateServerConfigInto(v) populates it; each sub-builder's error
is wrapped in a structured *ConfigError (with the original sentinel
kept on the Cause field, so errors.Is against errOidcMutuallyExclusive,
errServerURLSame/Suffix, ErrNoPrefixConfigured, and
ErrInvalidAllocationStrategy keeps working). v.Err() is checked once,
right before the &Config{} construction.
TestReadConfig/base-domain-in-server-url-err matched the old sentinel
wording; flipped to the new structured Reason. Added
TestLoadServerConfig_CollectsAcrossSubBuilders to lock the wiring:
four sub-builder failures from a single config produce four
*ConfigErrors in the report.
Updates juanfont#3227
Upstream tailscale.com/go.mod requires Go 1.26.3; the pinned
golang:1.26.2-alpine base refuses to fetch a newer toolchain
(GOTOOLCHAIN=local) and exits with:
go: go.mod requires go >= 1.26.3 (running go 1.26.2; GOTOOLCHAIN=local)
This blocks every PR's integration build. Bump to 1.26.3-alpine.
Same upstream go.mod >= 1.26.3 requirement broke the DERPer container build (TestDERPVerifyEndpoint and friends), in addition to the tailscale-HEAD container fixed in 513f629.
b87e77d to
211d864
Compare
listen_addr: 0.0.0.0:80+tls_letsencrypt_hostnamecollide internally — main HTTP and autocert HTTP-01 both want port 80. Second bind loses withaddress already in useand the ACME goroutinelog.Fatals. No hint that the conflict is internal.Now rejected at config load:
Real runtime bind failures get a typed
*ListenerBindErrorplus a CLI hint:errors.Is/As/Joinwork through the chain. Validation reports every violation in one pass. ACME listener moved into the main errgroup; orphanlog.Fatalis gone.Fixes #3227